Skip to content

server: avoid unnecessary checkpoint invalidation for recurrent / hybrid models#24035

Draft
Regrad wants to merge 1 commit into
ggml-org:masterfrom
Regrad:fix/qwen-hybrid-checkpoint-reuse
Draft

server: avoid unnecessary checkpoint invalidation for recurrent / hybrid models#24035
Regrad wants to merge 1 commit into
ggml-org:masterfrom
Regrad:fix/qwen-hybrid-checkpoint-reuse

Conversation

@Regrad
Copy link
Copy Markdown

@Regrad Regrad commented Jun 2, 2026

Summary

Improve prompt checkpoint reuse for recurrent and hybrid models in llama-server.

For these models, the memory position range stored in a checkpoint does not always map cleanly to the reusable prompt prefix length. As a result, valid checkpoints could be discarded during follow-up requests, causing unnecessary prompt re-processing even when the new request still shares a reusable prefix with the previous one.

What changed

This PR stores an additional pos_end value in each prompt checkpoint. It represents the end position of the prompt at the time the checkpoint was created.

When selecting a checkpoint for a new request:

  • checkpoints extending beyond the end of the new prompt are rejected;
  • recurrent and hybrid models use a dedicated reuse condition based on pos_max;
  • non-recurrent models keep the existing SWA-based condition.

When pruning checkpoints, a checkpoint is now removed if its saved prompt end exceeds the end of the new prompt, or if its memory range exceeds the current pos_next.

Expected effect

This avoids invalidating reusable checkpoints too aggressively for recurrent and hybrid models.

The expected result is less repeated prompt ingestion on follow-up requests with shared context, especially for long prompts and workloads where the same conversation prefix is reused across multiple requests.

Scope

The change is limited to prompt checkpoint bookkeeping and checkpoint selection in llama-server.

It does not change model inference logic or the existing checkpoint selection behavior for non-recurrent models.

@Regrad Regrad requested review from a team as code owners June 2, 2026 16:52
@Regrad Regrad closed this Jun 2, 2026
@Regrad Regrad reopened this Jun 2, 2026
@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented Jun 3, 2026

This looks like an interesting and simple fix, @ggerganov @ngxson what do you guys think?

@ggerganov
Copy link
Copy Markdown
Member

I haven't observed unnecessary checkpoint invalidation with recurrent models, so I am not sure what the change is trying to fix. Most of the reports that we get are due to the client injecting stuff in earlier messages which is inefficient, but I don't think we need to try to support.

As soon as I observe a valid problem, or get a proper report with a reproduction, I will fix it. I'm using pi daily and haven't observed any problems recently. The only optimization that is currently missing is a follow-up to #22929 to consider past user messages (not just the last one).

@Regrad
Copy link
Copy Markdown
Author

Regrad commented Jun 3, 2026

I haven't observed unnecessary checkpoint invalidation with recurrent models, so I am not sure what the change is trying to fix. Most of the reports that we get are due to the client injecting stuff in earlier messages which is inefficient, but I don't think we need to try to support.

As soon as I observe a valid problem, or get a proper report with a reproduction, I will fix it. I'm using pi daily and haven't observed any problems recently. The only optimization that is currently missing is a follow-up to #22929 to consider past user messages (not just the last one).

I'm using ryzen 395 max+ and Qwen 3.6 27b, as well as Qwen 3.5 122b. My cache is constantly being flushed, and the processing promt is being created again every time. This fix has resolved the issue. I've tested it on LM Studio on Vulcan (amd radeon 8060s).

@Regrad
Copy link
Copy Markdown
Author

Regrad commented Jun 3, 2026

Log:

18: slot update_slots: id  3 | task 17322 | new prompt, n_ctx_slot = 262144, n_keep = 15, task.n_tokens = 15
19: slot update_slots: id  3 | task 17322 | cache reuse is not supported - ignoring n_cache_reuse = 256
20: slot update_slots: id  3 | task 17322 | n_past = 15, slot.prompt.tokens.size() = 22, seq_id = 3, pos_min = 21, n_swa = 1
21: slot update_slots: id  3 | task 17322 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
22: slot update_slots: id  3 | task 17322 | n_tokens = 0, memory_seq_rm [0, end)
23: slot update_slots: id  3 | task 17322 | prompt processing progress, n_tokens = 11, batch.n_tokens = 12, progress = 0.733333
24: [2026-04-02 11:55:04][INFO][qwen3.5-122b-a10b@?] Prompt processing progress: 0.0%
192407: slot update_slots: id  0 | task 97964 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
192408: slot update_slots: id  0 | task 97964 | erased invalidated context checkpoint (pos_min = 154657, pos_max = 154657, n_tokens = 154658, n_swa = 0, pos_next = 0, size = 62.813 MiB)
192409: [2026-05-02 23:14:58][DEBUG] slot update_slots: id  0 | task 97964 | erased invalidated context checkpoint (pos_min = 154767, pos_max = 154767, n_tokens = 154768, n_swa = 0, pos_next = 0, size = 62.813 MiB)
192410: [2026-05-02 23:14:58][DEBUG] slot update_slots: id  0 | task 97964 | erased invalidated context checkpoint (pos_min = 155123, pos_max = 155123, n_tokens = 155124, n_swa = 0, pos_next = 0, size = 62.813 MiB)

194477: [2026-05-02 23:23:01][DEBUG] slot update_slots: id  0 | task 98398 | restored context checkpoint (pos_min = 18818, pos_max = 18818, n_tokens = 18819, n_past = 18819, size = 62.813 MiB)
195051: [2026-05-02 23:23:30][DEBUG] slot update_slots: id  0 | task 98572 | restored context checkpoint (pos_min = 8191, pos_max = 8191, n_tokens = 8192, n_past = 8192, size = 62.813 MiB)

@nssatlantis
Copy link
Copy Markdown

nssatlantis commented Jun 3, 2026

I haven't observed unnecessary checkpoint invalidation with recurrent models, so I am not sure what the change is trying to fix. Most of the reports that we get are due to the client injecting stuff in earlier messages which is inefficient, but I don't think we need to try to support.

As soon as I observe a valid problem, or get a proper report with a reproduction, I will fix it. I'm using pi daily and haven't observed any problems recently. The only optimization that is currently missing is a follow-up to #22929 to consider past user messages (not just the last one).

That exact PR seems to be filling up with not much hope in the comment section. This fix seems rather simple compared to that, and the likelyhood of breaking things isn't nearly as high as that PR is showing out to be.

@Regrad Regrad marked this pull request as draft June 4, 2026 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants