server: avoid unnecessary checkpoint invalidation for recurrent / hybrid models#24035
server: avoid unnecessary checkpoint invalidation for recurrent / hybrid models#24035Regrad wants to merge 1 commit into
Conversation
|
This looks like an interesting and simple fix, @ggerganov @ngxson what do you guys think? |
|
I haven't observed unnecessary checkpoint invalidation with recurrent models, so I am not sure what the change is trying to fix. Most of the reports that we get are due to the client injecting stuff in earlier messages which is inefficient, but I don't think we need to try to support. As soon as I observe a valid problem, or get a proper report with a reproduction, I will fix it. I'm using pi daily and haven't observed any problems recently. The only optimization that is currently missing is a follow-up to #22929 to consider past user messages (not just the last one). |
I'm using ryzen 395 max+ and Qwen 3.6 27b, as well as Qwen 3.5 122b. My cache is constantly being flushed, and the processing promt is being created again every time. This fix has resolved the issue. I've tested it on LM Studio on Vulcan (amd radeon 8060s). |
|
Log: |
That exact PR seems to be filling up with not much hope in the comment section. This fix seems rather simple compared to that, and the likelyhood of breaking things isn't nearly as high as that PR is showing out to be. |
Summary
Improve prompt checkpoint reuse for recurrent and hybrid models in
llama-server.For these models, the memory position range stored in a checkpoint does not always map cleanly to the reusable prompt prefix length. As a result, valid checkpoints could be discarded during follow-up requests, causing unnecessary prompt re-processing even when the new request still shares a reusable prefix with the previous one.
What changed
This PR stores an additional
pos_endvalue in each prompt checkpoint. It represents the end position of the prompt at the time the checkpoint was created.When selecting a checkpoint for a new request:
pos_max;When pruning checkpoints, a checkpoint is now removed if its saved prompt end exceeds the end of the new prompt, or if its memory range exceeds the current
pos_next.Expected effect
This avoids invalidating reusable checkpoints too aggressively for recurrent and hybrid models.
The expected result is less repeated prompt ingestion on follow-up requests with shared context, especially for long prompts and workloads where the same conversation prefix is reused across multiple requests.
Scope
The change is limited to prompt checkpoint bookkeeping and checkpoint selection in
llama-server.It does not change model inference logic or the existing checkpoint selection behavior for non-recurrent models.