Skip to content

server : fix prompt-cache reuse for hybrid/recurrent models#23121

Closed
bjahoor wants to merge 1 commit into
ggml-org:masterfrom
bjahoor:fix/hybrid-cache-restore
Closed

server : fix prompt-cache reuse for hybrid/recurrent models#23121
bjahoor wants to merge 1 commit into
ggml-org:masterfrom
bjahoor:fix/hybrid-cache-restore

Conversation

@bjahoor
Copy link
Copy Markdown

@bjahoor bjahoor commented May 15, 2026

My Nemotron model was super slow on every message because the server kept reprocessing the whole conversation. Found out this is a known bug on hybrid Mamba+attention models — caching is broken upstream.

Someone named Tongas fixed this on his fork a while ago (spiritbuun/buun-llama-cpp#26), and Alexey (sanmai) combined that with another change from the closed #22534 into a single commit (sanmai/llama.cpp@e0b8388). Both their PRs were against forks, never landed upstream.

I just got their patches working on the current master code. Had to rename two variables (modelmodel_tgt, ctx_seq_rm_typectx_tgt_seq_rm_type) because the project moved stuff around since their patches were written. Diff is +18 / -5 in one file.

Tested on my Jetson AGX Xavier running Nemotron-Elastic-12B-A2B i1-Q5_K_M. Before the fix, every message reprocessed everything from scratch (the server logs forcing full prompt re-processing due to lack of cache data). After the fix, message 2 only processed the new tokens. Tongas had the same result on Qwen3-Next variants on his RTX 3090 — 11 seconds down to 115 milliseconds.

This affects anyone running hybrid models like Nemotron, Jamba, or Qwen3.5+. Related reports: #22384, #16416, #14625, #19794, #20225.

Happy to add a test if useful. Can also walk through the code or run more reproductions.

AI usage: used Claude to help me find which closed PRs and forks had the relevant patches and to draft a first version of this description, which the bot caught and I rewrote by hand. Code is human work I rebased. I can defend every line without AI.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES. Claude helped me find the relevant prior closed PRs and forks. The first draft of this description was AI-assisted (which the automated checker correctly flagged), so I rewrote it in my own words. The code itself is human-authored prior work by @Tongas and @sanmai that I rebased onto current master.

@bjahoor bjahoor requested a review from a team as a code owner May 15, 2026 21:21
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented May 15, 2026

Hi @bjahoor, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@bjahoor
Copy link
Copy Markdown
Author

bjahoor commented May 15, 2026

Description rewritten in my own words. The original draft was AI-assisted (which I disclosed at the bottom of the original); the bot flag was fair. The code itself is unchanged; same diff, same commit. Happy to walk through any of it.

Comment thread tools/server/server-context.cpp Outdated
do_checkpoint = do_checkpoint && (pos_min >= 0 && slot.prompt.n_tokens() >= 64);
// for hybrid/recurrent models, lower the checkpoint threshold so short prompts also get checkpointed
const int checkpoint_min_tokens = (llama_model_is_recurrent(model_tgt) || llama_model_is_hybrid(model_tgt)) ? 4 : 64;
do_checkpoint = do_checkpoint && (pos_min >= 0 && slot.prompt.n_tokens() >= checkpoint_min_tokens);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely should not change the floor to 4 tokens. It is in my commit but it was merely a PoC so please do not merge it without due understanding.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was another PR by someone where this part was rejected and for a good reason.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, dropped it. Re-tested with a turn-1 prompt over the 64-token floor:

Turn 1 (86 tokens, cold): checkpoint created at pos 81
Turn 2 (107 tokens, extension): restored checkpoint, only 25 new tokens prefilled. cache_n=82.

The predicate fix and the seq_rm graceful failure both still work; the 64-token floor on checkpoint creation is unchanged.

Three changes to tools/server/server-context.cpp restore working
prompt-cache reuse on hybrid Mamba+attention architectures
(Nemotron-H, Jamba, Qwen3.5/3.6/Next, Granite-H, Falcon-H1)
which currently force full re-processing on every conversation turn.

1. Checkpoint search predicate: for hybrid/recurrent models pos_min
   always equals the sequence length, so the SWA-based check never
   matches. Use pos_max <= pos_next instead.

2. seq_rm failure handling: when partial seq_rm fails after a
   checkpoint was restored, keep the cached state instead of
   clearing the slot.

3. Checkpoint creation threshold: lower from 64 to 4 tokens for
   hybrid/recurrent models so short prompts can also be cached.

Tested on Qwen3.6-27B (RTX 3090, original work by Tongas) and on
Nemotron-Elastic-12B-A2B (Jetson AGX Xavier sm_72).

Based on prior work: see PR description for full attribution.
@bjahoor bjahoor force-pushed the fix/hybrid-cache-restore branch from 5f649dc to 3ce9084 Compare May 15, 2026 22:13
@bjahoor
Copy link
Copy Markdown
Author

bjahoor commented May 16, 2026

Closing this — the underlying issue was resolved upstream by #22673 (MTP Support, commit 255582687), which added per-token snapshot support enabling partial seq_rm on recurrent/hybrid memory.

I retested Nemotron-H on clean master at 0253fb21f and prompt caching works correctly (turn 2 reused 82 cached tokens via checkpoint restore).

Thanks @sanmai for the earlier feedback.

@bjahoor bjahoor closed this May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants