server : fix prompt-cache reuse for hybrid/recurrent models#23121
server : fix prompt-cache reuse for hybrid/recurrent models#23121bjahoor wants to merge 1 commit into
Conversation
|
Hi @bjahoor, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
Description rewritten in my own words. The original draft was AI-assisted (which I disclosed at the bottom of the original); the bot flag was fair. The code itself is unchanged; same diff, same commit. Happy to walk through any of it. |
| do_checkpoint = do_checkpoint && (pos_min >= 0 && slot.prompt.n_tokens() >= 64); | ||
| // for hybrid/recurrent models, lower the checkpoint threshold so short prompts also get checkpointed | ||
| const int checkpoint_min_tokens = (llama_model_is_recurrent(model_tgt) || llama_model_is_hybrid(model_tgt)) ? 4 : 64; | ||
| do_checkpoint = do_checkpoint && (pos_min >= 0 && slot.prompt.n_tokens() >= checkpoint_min_tokens); |
There was a problem hiding this comment.
We definitely should not change the floor to 4 tokens. It is in my commit but it was merely a PoC so please do not merge it without due understanding.
There was a problem hiding this comment.
There was another PR by someone where this part was rejected and for a good reason.
There was a problem hiding this comment.
Thanks, dropped it. Re-tested with a turn-1 prompt over the 64-token floor:
Turn 1 (86 tokens, cold): checkpoint created at pos 81
Turn 2 (107 tokens, extension): restored checkpoint, only 25 new tokens prefilled. cache_n=82.
The predicate fix and the seq_rm graceful failure both still work; the 64-token floor on checkpoint creation is unchanged.
Three changes to tools/server/server-context.cpp restore working prompt-cache reuse on hybrid Mamba+attention architectures (Nemotron-H, Jamba, Qwen3.5/3.6/Next, Granite-H, Falcon-H1) which currently force full re-processing on every conversation turn. 1. Checkpoint search predicate: for hybrid/recurrent models pos_min always equals the sequence length, so the SWA-based check never matches. Use pos_max <= pos_next instead. 2. seq_rm failure handling: when partial seq_rm fails after a checkpoint was restored, keep the cached state instead of clearing the slot. 3. Checkpoint creation threshold: lower from 64 to 4 tokens for hybrid/recurrent models so short prompts can also be cached. Tested on Qwen3.6-27B (RTX 3090, original work by Tongas) and on Nemotron-Elastic-12B-A2B (Jetson AGX Xavier sm_72). Based on prior work: see PR description for full attribution.
5f649dc to
3ce9084
Compare
|
Closing this — the underlying issue was resolved upstream by #22673 (MTP Support, commit I retested Nemotron-H on clean master at Thanks @sanmai for the earlier feedback. |
My Nemotron model was super slow on every message because the server kept reprocessing the whole conversation. Found out this is a known bug on hybrid Mamba+attention models — caching is broken upstream.
Someone named Tongas fixed this on his fork a while ago (spiritbuun/buun-llama-cpp#26), and Alexey (sanmai) combined that with another change from the closed #22534 into a single commit (sanmai/llama.cpp@e0b8388). Both their PRs were against forks, never landed upstream.
I just got their patches working on the current master code. Had to rename two variables (
model→model_tgt,ctx_seq_rm_type→ctx_tgt_seq_rm_type) because the project moved stuff around since their patches were written. Diff is +18 / -5 in one file.Tested on my Jetson AGX Xavier running Nemotron-Elastic-12B-A2B i1-Q5_K_M. Before the fix, every message reprocessed everything from scratch (the server logs
forcing full prompt re-processing due to lack of cache data). After the fix, message 2 only processed the new tokens. Tongas had the same result on Qwen3-Next variants on his RTX 3090 — 11 seconds down to 115 milliseconds.This affects anyone running hybrid models like Nemotron, Jamba, or Qwen3.5+. Related reports: #22384, #16416, #14625, #19794, #20225.
Happy to add a test if useful. Can also walk through the code or run more reproductions.
AI usage: used Claude to help me find which closed PRs and forks had the relevant patches and to draft a first version of this description, which the bot caught and I rewrote by hand. Code is human work I rebased. I can defend every line without AI.
Requirements