server : checkpoint before every user turn boundary by mfielding92 · Pull Request #23814 · ggml-org/llama.cpp

mfielding92 · 2026-05-28T11:50:44Z

Overview

#22929 creates a context checkpoint only before the last user message, so prompts with a stable prefix and content that changes between turns lose all checkpoint cache hits (the surviving checkpoints sit past the divergence point) and re-evaluate the full prompt every turn, notably on SWA models.

Create a checkpoint before every user message instead, derived from the same message_spans. prompt_get_n_before_user() becomes prompt_get_user_boundaries(); the prefill batch breaks at each boundary and a checkpoint is allowed at any of them, still bounded by --checkpoint-min-step and --ctx-checkpoints.

Additional information

Addresses the regression: #22929 (comment)

Originated from: 57f81cb
(Commit #22929 resolved some of these issues, but not all of them)

Checkpointing system working in testing: #22746 (comment)
(This PR uses the same technique, simply working alongside #22929 since it was originally designed before that commit)

Requirements

I have read and agree with the contributing guidelines: Yes
AI usage disclosure: Yes
Diagnostics, patch drafts and suggestions, automating labor
(example: running scripts and waiting for them to finish, recording results)
Human: Yes (of course!)
Code review and finalization, task management, quality assurance.
(I was consistently present, directing, and analyzing the code the entire time this patch was being worked on.)

ggml-org#22929 creates a context checkpoint only before the last user message, so prompts with a stable prefix and content that changes between turns lose all checkpoint cache hits (the surviving checkpoints sit past the divergence point) and re-evaluate the full prompt every turn, notably on SWA models. Create a checkpoint before every user message instead, derived from the same message_spans. prompt_get_n_before_user() becomes prompt_get_user_boundaries(); the prefill batch breaks at each boundary and a checkpoint is allowed at any of them, still bounded by --checkpoint-min-step and --ctx-checkpoints.

Adds a backstep-and-restore path that prevents `update_slots` from wiping the entire slot on every multi-turn request to QWEN35MOE / QWEN3NEXT (hybrid Gated DeltaNet + attention). The post-LCP cut is rounded back to the nearest chat-template turn boundary, the existing context-checkpoint infrastructure is reused to restore state when the recurrent backend rejects partial seq_rm, and a checkpoint is forced on every boundary token so a usable rescue point always exists. Drops per-turn cost from ~30s full re-eval to ~1s O(delta) on a 36k-token Qwen 3.6 27B test (~45x).

mfielding92 · 2026-05-28T21:09:14Z

@pwilkin @ggerganov added performance testing to PR, verification of no performance degradation. Let me know if this needs anything else to be ready for review.

pwilkin

TBH this looks good to me (minimal changes and pretty self-contained), would prob like @ngxson to take a look wrt. to the MTMD code.

ngxson · 2026-05-29T08:52:21Z

The mtmd code already exists from the previous version before this PR, so I think it won't change much. But I have to note that this part of the code is a bit over my head so I cannot tell for sure.

In anyway, since we now create checkpoint for both user and assistant message boundary, wouldn't it easier to simply create the checkpoint at every EOG tokens instead? Wouldn't detecting boundary via tokenizing become an overkill solution now?

aldehir · 2026-05-30T01:39:45Z

In anyway, since we now create checkpoint for both user and assistant message boundary, wouldn't it easier to simply create the checkpoint at every EOG tokens instead?

Hmm, I think you still want to keep the last user checkpoint at all times. So when reasoning gets striped from a long-horizon agentic task, it can rollback to the last user checkpoint instead of the start (since it would have been evicted by then). I think we probably need to cache checkpoints with different eviction rules and maintain those at certain boundaries.

ngxson · 2026-05-30T11:00:48Z

@aldehir not quite sure if I understand your comment correctly, let's take an example:

<turn>user
prompt1
<turn>assistant
response1
<turn>user
prompt2
<turn>assistant
reasoning_content
response2

so on master currently we have a checkpoint right after prompt2

on the next user message, prompt3, the reasoning_content will be removed and we rollback to prompt2 checkpoint, things look good.

in either cases, I still don't quite sure what this PR is trying to address. to add to that, @reedmayhew18 your AI-generated comment #23814 (comment) is not welcomed here as (1) we strictly forbidden AI-written comments and (2) your result doesn't even represent all possible cases that can happen. please do not post excessive AI-generated content, that is not helpful

mfielding92 · 2026-05-30T12:55:35Z

@aldehir not quite sure if I understand your comment correctly, let's take an example:
<turn>user
prompt1
<turn>assistant
response1
<turn>user
prompt2
<turn>assistant
reasoning_content
response2
so on master currently we have a checkpoint right after prompt2

on the next user message, prompt3, the reasoning_content will be removed and we rollback to prompt2 checkpoint, things look good.

in either cases, I still don't quite sure what this PR is trying to address. to add to that, @reedmayhew18 your AI-generated comment #23814 (comment) is not welcomed here as (1) we strictly forbidden AI-written comments and (2) your result doesn't even represent all possible cases that can happen. please do not post excessive AI-generated content, that is not helpful

@aldehir Hello, my apologies for using A.I. to assist with my responses. I have a hard time typing out long replies due to some neurological health issues I have, so it helps me make sure I have all of my thoughts laid out cleanly without any gaps. I can avoid using it moving forward if its an issue.

What this patch addresses:

Essentially, without this patch, my computer (and several others) ends up reprocessing tens of thousands of tokens on every turn (sometimes hundreds of thousands with vary long conversations) if checkpoints are not made on an intelligent basis.

As demonstrated by the testing, any subtle change in previous messages right now has the potential to cause a full prompt re-evaluation which is extremely inefficient and wasteful on time. I have 122b-190b param models loaded sometimes and they are unusable without the patch, as forcing them to re-evaluate a prompt can sometimes take up to 30 minutes, and if its running tools, force it to re-evaluate the entire prompt multiple times per turn. It's completely miserable and unbearable.

I use these models like 8+ hours a day (for personal and work) with Claude Code, Open-webui, Kai (mobile app), etc. And for example: Kai isn't the best programmed and does some timestamp adjustment in previous messages and causes the current master branch to re-evaluate the entire prompt multiple times per turn. With this patch, responses take seconds vs. several minutes with 100b+ models.

Hopefully that gives a better real-world example. I'm personally shocked that people aren't using the system without some sort of patch like the one I'm suggesting. Just because I genuinely thought my computer was too slow to handle these models until I spend several dozen hours working on this.

Lastly, please understand: I am fully aware A.I. was used to assist with this coding and if that's an issue, I'm more than happy to let someone else look at my suggestion and re-write it themselves in a human manner. Please understand that I am very knowledgeable about how these systems work and understand the workflow about these things, it is simply a limitation on my ability to physically type out the code from scratch. (I am in classes learning these languages, I just have very much difficulty typing/remembering syntax due to my neurological conditions.

Hopefully that makes sense. Not trying to cause any issues. Just trying to help. I promise I can answer any questions you have about the code, part of my current job is being an IT Security Administrator overseeing a fleet of computers that need constant diagnosis, fixing, all this stuff. Linux is my daily driver, I dont use windows at home, just at work. Only including these things so you understand that I'm not just another "vibe coder" submitting PRs without any clue of what they're doing.

Thanks and let me know if you have any questions!

ngxson · 2026-05-30T13:07:27Z

Only including these things so you understand that I'm not just another "vibe coder" submitting PRs without any clue of what they're doing.

including a wall of text doesn't make you different from any other vibe coders

let me put this simple: either you should answer my question (in my last comments) concisely and straight to the point, or let's close this PR and let another community member to submit a better, less AI-excessive patch.

mfielding92 · 2026-05-30T13:11:52Z

Only including these things so you understand that I'm not just another "vibe coder" submitting PRs without any clue of what they're doing.

including a wall of text doesn't make you different from any other vibe coders

I genuinely can help with this stuff and others can't. That's a huge difference regardless of the size of text. Either way, I understand the importance of conciseness.

let me put this simple: either you should answer my question (in my last comments) concisely and straight to the point, or let's close this PR and let another community member to submit a better, less AI-excessive patch.

I was responding to the code change as you replied. I was simply addressing both of your concerns separately. (Why I was using A.I. and then the code changes requested.)

Sorry about the wall of text, that is how I naturally talk, which is what you're requiring here. This is where I'd typically have something help me condense my thoughts down. I'll do my best to keep responses short.

Catch-Up to Current Master

ngxson · 2026-05-30T16:19:35Z

I might be a bit hard here, but the truth is that this PR isn't much different than other that are fully AI-generated

I don't believe patching that small prompt_get_n_before_user is an acceptable solution,but sadly that's what an AI will do if you ask it to. AI cannot look at the broader picture, unfortunately. The prompt_get_n_before_user itself is already a hacky solution and we don't want to extend from that, it will become extremely difficult to refactor it in the future.

It's best to open a discussion to see if the chat template system maintainers have a solution to somehow expose the boundary from the chat template engine back to the caller. I strongly believe it's feasible, and even if it's not, the logic to handle message boundaries should still eventually be in the chat.cpp, not inside server.

Closing this PR since both your current impl and the proposed one #23814 (comment) sounds too overkill

Xcc313r4n7 · 2026-06-03T03:04:40Z

Only including these things so you understand that I'm not just another "vibe coder" submitting PRs without any clue of what they're doing.

including a wall of text doesn't make you different from any other vibe coders

let me put this simple: either you should answer my question (in my last comments) concisely and straight to the point, or let's close this PR and let another community member to submit a better, less AI-excessive patch.

What a dickish comment

Xcc313r4n7 · 2026-06-03T04:58:42Z

Gotta love the elitist gatekeepers that maintain this project

Xcc313r4n7 · 2026-06-03T05:02:59Z

Only including these things so you understand that I'm not just another "vibe coder" submitting PRs without any clue of what they're doing.

including a wall of text doesn't make you different from any other vibe coders
let me put this simple: either you should answer my question (in my last comments) concisely and straight to the point, or let's close this PR and let another community member to submit a better, less AI-excessive patch.

What a dickish comment

It's probably the most insane thing I have seen on a project exclusively about creating AI and to be honest this person has the same kind of understanding some AI's do. Ever tried to get an AI to understand something it just can't understand because it is not in it's training data? That's the vibe I get from this guy. He just can't compute, His breain was stuck in a vibe coded loop.

This guy moderates at least 10 Reddit channels. I guarentee it! You simply can not be that much of a dick and not be a reddit mod.

That's an insult to AI though, maybe he should have used AI for HIS response, it would have been better than what he decided to go with

ddh0 · 2026-06-03T05:12:46Z

Hi @Xcc313r4n7 and @nihilistau. Thank you for your constructive feedback about how to moderate contributions to a project that neither of you have ever contributed to. If you understand this part of the code and are able to contribute a legitimate fix that you can explain in your own words, please feel free to submit a PR.

mfielding92 requested a review from a team as a code owner May 28, 2026 11:50

github-actions Bot added examples server labels May 28, 2026

pwilkin mentioned this pull request May 28, 2026

server: system prompt checkpoint #23808

Closed

pwilkin approved these changes May 29, 2026

View reviewed changes

ngxson requested changes May 30, 2026

View reviewed changes

Comment thread tools/server/server-context.cpp

Merge pull request #3 from ggml-org/master

d6f7a66

Catch-Up to Current Master

ngxson closed this May 30, 2026

This comment was marked as spam.

Sign in to view

ggml-org locked as off-topic and limited conversation to collaborators Jun 3, 2026

Conversation

mfielding92 commented May 28, 2026

Overview

Additional information

Requirements

Uh oh!

mfielding92 commented May 28, 2026

Uh oh!

pwilkin left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson commented May 29, 2026

Uh oh!

aldehir commented May 30, 2026

Uh oh!

Uh oh!

ngxson commented May 30, 2026

Uh oh!

mfielding92 commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented May 30, 2026

Uh oh!

mfielding92 commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xcc313r4n7 commented Jun 3, 2026

Uh oh!

This comment was marked as spam.

Xcc313r4n7 commented Jun 3, 2026

Uh oh!

Xcc313r4n7 commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as spam.

ddh0 commented Jun 3, 2026

Uh oh!

This comment was marked as spam.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mfielding92 commented May 30, 2026 •

edited

Loading

mfielding92 commented May 30, 2026 •

edited

Loading

ngxson commented May 30, 2026 •

edited

Loading

Xcc313r4n7 commented Jun 3, 2026 •

edited

Loading