server: avoid unnecessary checkpoint invalidation for recurrent / hybrid models by Regrad · Pull Request #24035 · ggml-org/llama.cpp

Regrad · 2026-06-02T16:52:23Z

Summary

Improve prompt checkpoint reuse for recurrent and hybrid models in llama-server.

For these models, the memory position range stored in a checkpoint does not always map cleanly to the reusable prompt prefix length. As a result, valid checkpoints could be discarded during follow-up requests, causing unnecessary prompt re-processing even when the new request still shares a reusable prefix with the previous one.

What changed

This PR stores an additional pos_end value in each prompt checkpoint. It represents the end position of the prompt at the time the checkpoint was created.

When selecting a checkpoint for a new request:

checkpoints extending beyond the end of the new prompt are rejected;
recurrent and hybrid models use a dedicated reuse condition based on pos_max;
non-recurrent models keep the existing SWA-based condition.

When pruning checkpoints, a checkpoint is now removed if its saved prompt end exceeds the end of the new prompt, or if its memory range exceeds the current pos_next.

Expected effect

This avoids invalidating reusable checkpoints too aggressively for recurrent and hybrid models.

The expected result is less repeated prompt ingestion on follow-up requests with shared context, especially for long prompts and workloads where the same conversation prefix is reused across multiple requests.

Scope

The change is limited to prompt checkpoint bookkeeping and checkpoint selection in llama-server.

It does not change model inference logic or the existing checkpoint selection behavior for non-recurrent models.

pwilkin · 2026-06-03T09:53:39Z

This looks like an interesting and simple fix, @ggerganov @ngxson what do you guys think?

ggerganov · 2026-06-03T10:41:57Z

I haven't observed unnecessary checkpoint invalidation with recurrent models, so I am not sure what the change is trying to fix. Most of the reports that we get are due to the client injecting stuff in earlier messages which is inefficient, but I don't think we need to try to support.

As soon as I observe a valid problem, or get a proper report with a reproduction, I will fix it. I'm using pi daily and haven't observed any problems recently. The only optimization that is currently missing is a follow-up to #22929 to consider past user messages (not just the last one).

Regrad · 2026-06-03T11:21:45Z

I haven't observed unnecessary checkpoint invalidation with recurrent models, so I am not sure what the change is trying to fix. Most of the reports that we get are due to the client injecting stuff in earlier messages which is inefficient, but I don't think we need to try to support.

As soon as I observe a valid problem, or get a proper report with a reproduction, I will fix it. I'm using pi daily and haven't observed any problems recently. The only optimization that is currently missing is a follow-up to #22929 to consider past user messages (not just the last one).

I'm using ryzen 395 max+ and Qwen 3.6 27b, as well as Qwen 3.5 122b. My cache is constantly being flushed, and the processing promt is being created again every time. This fix has resolved the issue. I've tested it on LM Studio on Vulcan (amd radeon 8060s).

Regrad · 2026-06-03T11:30:20Z

Log:

18: slot update_slots: id  3 | task 17322 | new prompt, n_ctx_slot = 262144, n_keep = 15, task.n_tokens = 15
19: slot update_slots: id  3 | task 17322 | cache reuse is not supported - ignoring n_cache_reuse = 256
20: slot update_slots: id  3 | task 17322 | n_past = 15, slot.prompt.tokens.size() = 22, seq_id = 3, pos_min = 21, n_swa = 1
21: slot update_slots: id  3 | task 17322 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
22: slot update_slots: id  3 | task 17322 | n_tokens = 0, memory_seq_rm [0, end)
23: slot update_slots: id  3 | task 17322 | prompt processing progress, n_tokens = 11, batch.n_tokens = 12, progress = 0.733333
24: [2026-04-02 11:55:04][INFO][qwen3.5-122b-a10b@?] Prompt processing progress: 0.0%

192407: slot update_slots: id  0 | task 97964 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
192408: slot update_slots: id  0 | task 97964 | erased invalidated context checkpoint (pos_min = 154657, pos_max = 154657, n_tokens = 154658, n_swa = 0, pos_next = 0, size = 62.813 MiB)
192409: [2026-05-02 23:14:58][DEBUG] slot update_slots: id  0 | task 97964 | erased invalidated context checkpoint (pos_min = 154767, pos_max = 154767, n_tokens = 154768, n_swa = 0, pos_next = 0, size = 62.813 MiB)
192410: [2026-05-02 23:14:58][DEBUG] slot update_slots: id  0 | task 97964 | erased invalidated context checkpoint (pos_min = 155123, pos_max = 155123, n_tokens = 155124, n_swa = 0, pos_next = 0, size = 62.813 MiB)

194477: [2026-05-02 23:23:01][DEBUG] slot update_slots: id  0 | task 98398 | restored context checkpoint (pos_min = 18818, pos_max = 18818, n_tokens = 18819, n_past = 18819, size = 62.813 MiB)
195051: [2026-05-02 23:23:30][DEBUG] slot update_slots: id  0 | task 98572 | restored context checkpoint (pos_min = 8191, pos_max = 8191, n_tokens = 8192, n_past = 8192, size = 62.813 MiB)

nssatlantis · 2026-06-03T23:59:02Z

I haven't observed unnecessary checkpoint invalidation with recurrent models, so I am not sure what the change is trying to fix. Most of the reports that we get are due to the client injecting stuff in earlier messages which is inefficient, but I don't think we need to try to support.

As soon as I observe a valid problem, or get a proper report with a reproduction, I will fix it. I'm using pi daily and haven't observed any problems recently. The only optimization that is currently missing is a follow-up to #22929 to consider past user messages (not just the last one).

That exact PR seems to be filling up with not much hope in the comment section. This fix seems rather simple compared to that, and the likelyhood of breaking things isn't nearly as high as that PR is showing out to be.

server: improve checkpoint reuse heuristics for recurrent/hybrid models

fc6a7e0

Regrad requested review from a team as code owners June 2, 2026 16:52

github-actions Bot added examples server labels Jun 2, 2026

Regrad closed this Jun 2, 2026

Regrad reopened this Jun 2, 2026

Regrad marked this pull request as draft June 4, 2026 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: avoid unnecessary checkpoint invalidation for recurrent / hybrid models#24035

server: avoid unnecessary checkpoint invalidation for recurrent / hybrid models#24035
Regrad wants to merge 1 commit into
ggml-org:masterfrom
Regrad:fix/qwen-hybrid-checkpoint-reuse

Regrad commented Jun 2, 2026 •

edited

Loading

Uh oh!

pwilkin commented Jun 3, 2026

Uh oh!

ggerganov commented Jun 3, 2026

Uh oh!

Regrad commented Jun 3, 2026

Uh oh!

Regrad commented Jun 3, 2026

Uh oh!

nssatlantis commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Regrad commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Expected effect

Scope

Uh oh!

pwilkin commented Jun 3, 2026

Uh oh!

ggerganov commented Jun 3, 2026

Uh oh!

Regrad commented Jun 3, 2026

Uh oh!

Regrad commented Jun 3, 2026

Uh oh!

nssatlantis commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Regrad commented Jun 2, 2026 •

edited

Loading

nssatlantis commented Jun 3, 2026 •

edited

Loading