server : fix prompt-cache reuse for hybrid/recurrent models by bjahoor · Pull Request #23121 · ggml-org/llama.cpp

bjahoor · 2026-05-15T21:21:48Z

My Nemotron model was super slow on every message because the server kept reprocessing the whole conversation. Found out this is a known bug on hybrid Mamba+attention models — caching is broken upstream.

Someone named Tongas fixed this on his fork a while ago (spiritbuun/buun-llama-cpp#26), and Alexey (sanmai) combined that with another change from the closed #22534 into a single commit (sanmai/llama.cpp@e0b8388). Both their PRs were against forks, never landed upstream.

I just got their patches working on the current master code. Had to rename two variables (model → model_tgt, ctx_seq_rm_type → ctx_tgt_seq_rm_type) because the project moved stuff around since their patches were written. Diff is +18 / -5 in one file.

Tested on my Jetson AGX Xavier running Nemotron-Elastic-12B-A2B i1-Q5_K_M. Before the fix, every message reprocessed everything from scratch (the server logs forcing full prompt re-processing due to lack of cache data). After the fix, message 2 only processed the new tokens. Tongas had the same result on Qwen3-Next variants on his RTX 3090 — 11 seconds down to 115 milliseconds.

This affects anyone running hybrid models like Nemotron, Jamba, or Qwen3.5+. Related reports: #22384, #16416, #14625, #19794, #20225.

Happy to add a test if useful. Can also walk through the code or run more reproductions.

AI usage: used Claude to help me find which closed PRs and forks had the relevant patches and to draft a first version of this description, which the bot caught and I rewrote by hand. Code is human work I rebased. I can defend every line without AI.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES. Claude helped me find the relevant prior closed PRs and forks. The first draft of this description was AI-assisted (which the automated checker correctly flagged), so I rewrote it in my own words. The code itself is human-authored prior work by @Tongas and @sanmai that I rebased onto current master.

ggml-gh-bot · 2026-05-15T21:26:10Z

Hi @bjahoor, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

bjahoor · 2026-05-15T21:52:56Z

Description rewritten in my own words. The original draft was AI-assisted (which I disclosed at the bottom of the original); the bot flag was fair. The code itself is unchanged; same diff, same commit. Happy to walk through any of it.

sanmai · 2026-05-15T22:02:55Z

-                    do_checkpoint = do_checkpoint && (pos_min >= 0 && slot.prompt.n_tokens() >= 64);
+                    // for hybrid/recurrent models, lower the checkpoint threshold so short prompts also get checkpointed
+                    const int checkpoint_min_tokens = (llama_model_is_recurrent(model_tgt) || llama_model_is_hybrid(model_tgt)) ? 4 : 64;
+                    do_checkpoint = do_checkpoint && (pos_min >= 0 && slot.prompt.n_tokens() >= checkpoint_min_tokens);


We definitely should not change the floor to 4 tokens. It is in my commit but it was merely a PoC so please do not merge it without due understanding.

There was another PR by someone where this part was rejected and for a good reason.

Thanks, dropped it. Re-tested with a turn-1 prompt over the 64-token floor:

Turn 1 (86 tokens, cold): checkpoint created at pos 81
Turn 2 (107 tokens, extension): restored checkpoint, only 25 new tokens prefilled. cache_n=82.

The predicate fix and the seq_rm graceful failure both still work; the 64-token floor on checkpoint creation is unchanged.

Three changes to tools/server/server-context.cpp restore working prompt-cache reuse on hybrid Mamba+attention architectures (Nemotron-H, Jamba, Qwen3.5/3.6/Next, Granite-H, Falcon-H1) which currently force full re-processing on every conversation turn. 1. Checkpoint search predicate: for hybrid/recurrent models pos_min always equals the sequence length, so the SWA-based check never matches. Use pos_max <= pos_next instead. 2. seq_rm failure handling: when partial seq_rm fails after a checkpoint was restored, keep the cached state instead of clearing the slot. 3. Checkpoint creation threshold: lower from 64 to 4 tokens for hybrid/recurrent models so short prompts can also be cached. Tested on Qwen3.6-27B (RTX 3090, original work by Tongas) and on Nemotron-Elastic-12B-A2B (Jetson AGX Xavier sm_72). Based on prior work: see PR description for full attribution.

bjahoor · 2026-05-16T18:36:46Z

Closing this — the underlying issue was resolved upstream by #22673 (MTP Support, commit 255582687), which added per-token snapshot support enabling partial seq_rm on recurrent/hybrid memory.

I retested Nemotron-H on clean master at 0253fb21f and prompt caching works correctly (turn 2 reused 82 cached tokens via checkpoint restore).

Thanks @sanmai for the earlier feedback.

bjahoor requested a review from a team as a code owner May 15, 2026 21:21

github-actions Bot added examples server labels May 15, 2026

sanmai reviewed May 15, 2026

View reviewed changes

bjahoor force-pushed the fix/hybrid-cache-restore branch from 5f649dc to 3ce9084 Compare May 15, 2026 22:13

bjahoor closed this May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : fix prompt-cache reuse for hybrid/recurrent models#23121

server : fix prompt-cache reuse for hybrid/recurrent models#23121
bjahoor wants to merge 1 commit into
ggml-org:masterfrom
bjahoor:fix/hybrid-cache-restore

bjahoor commented May 15, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot Bot commented May 15, 2026

Uh oh!

bjahoor commented May 15, 2026

Uh oh!

sanmai May 15, 2026

Uh oh!

sanmai May 15, 2026

Uh oh!

bjahoor May 15, 2026

Uh oh!

bjahoor commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bjahoor commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirements

Uh oh!

ggml-gh-bot Bot commented May 15, 2026

Uh oh!

bjahoor commented May 15, 2026

Uh oh!

sanmai May 15, 2026

Choose a reason for hiding this comment

Uh oh!

sanmai May 15, 2026

Choose a reason for hiding this comment

Uh oh!

bjahoor May 15, 2026

Choose a reason for hiding this comment

Uh oh!

bjahoor commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bjahoor commented May 15, 2026 •

edited

Loading