server: enable checkpoint for recurrent models by firecoperana · Pull Request #1310 · ikawrakow/ik_llama.cpp

firecoperana · 2026-02-24T02:10:24Z

This PR enables checkpoints for recurrent models without the need to port the recurrent cache code from mainline. It always creates the checkpoint after prompt is fully processed and response ends.

--ctx-checkpoints: set the number of checkpoints per slot
--ctx-checkpoints-interval: minimum number of tokens between each context checkpoint. If you want to create the checkpoint more frequently, set it to a small value. If it's set to positive number, it saves checkpoints during TG at this interval. During PP, it can only save checkpoint every batch size, so it becomes minimum number of tokens between each context checkpoint.

Use llm_arch_is_hybrid to replace QWEN3NEXT and QWEN35MOE when dealing with recurrent/hybrid models.

Fix the bug that ban strings not work and kv cache not removed. @SneedwareInc

MrHills-rs · 2026-02-24T08:55:56Z

Is the kv cache is being deleted for every batch?

INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140581237866496" timestamp=1771922359 id_slot=0 id_task=216 p0=0 INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140581237866496" timestamp=1771922366 id_slot=0 id_task=216 p0=4096

The console says "Common part does not match fully" but the text of cache and prompt are identical.

Even without deleting any message the whole context is still being rebuilt from position 0.

This is without interval btw. With interval only a small portion is being rebuilt. That probably means that the cache is problematic, my front end should be sending the same identical prompt every time.

firecoperana · 2026-02-24T13:39:37Z

Better check if your front end injects/changes anything in your prompt. If you see "Common part does not match fully", your prompt does not fully contains your cache. Check the log to see the difference.
If the common prefix between your cache and prompts differs even a slightly from the cache, and you didn't create a checkpoint, it will be built from 0, and you need to set the --ctx-checkpoints-interval.

ikawrakow · 2026-02-24T14:51:44Z

LGTM, but can we have 2-3 people confirm that it works for them?

SneedwareInc · 2026-02-24T16:58:56Z

String ban works again.

firecoperana · 2026-02-24T22:23:03Z

Update to save only recurrent cache for the checkpoint. It reduces the size of checkpoint on longer context size.

MrHills-rs · 2026-02-24T22:52:02Z

@firecoperana Could you try this:

Send a long prompt
Get the answer from the llm
Save your kv cache on disk via API
Load that same kv cache
Send another message
See if it rebuild the cache from zero

Because I did this with b3cf43e and as expected the kv cache didn't need to rebuild.
I did the same thing with this PR and it said something about the cache and prompt not matching and it rebuild everything from 0.

True, I did this with silly tavern, and that program has a million settings and missing one that changes the prompt isn't impossible. But I kept the same browser window open for both tests, same st session, I only changed the ik PR.

This, btw, I tried both with qwen3.5 and MinimaxM2.5, which ofc isn't hybrid.

firecoperana · 2026-02-25T00:30:18Z

These two are thinking models. Did the old prompt get saved to ram when it is rebuilt from 0? Need to see your detailed logs to understand what's going on. I tried with Qwen3, but it worked fine.

MrHills-rs · 2026-02-25T07:03:35Z

These two are thinking models. Did the old prompt get saved to ram when it is rebuilt from 0? Need to see your detailed logs to understand what's going on. I tried with Qwen3, but it worked fine.

Fair, I've been lazy with the tests on this one.

I'm using --reasoning-format none and handling thinking tags at the front end level using a custom jinja. This might sound strange but it's the only way to get every model to work both in thinking and non thinking mode. Thinking tags are treated as if they were normal text, and the Jinja outputs everything unprocessed.

Again this works for every model in main, Qwen3.5 included, but not in this PR. But I don't think the thinking is the problem, because I have a 12000 tokens system prompt, and in non hybrid models like minimax-m2.5, even if the thinking doesn't match, I would still expect those 12000 tokens to not be recalculated.
There might be situations in which there is a second system prompt at the end of the conversation too, btw. But again, even if that didn't match everything else should still not need a rebuild for non recursive.

I'll compare everything with -v properly, size of the .bin file included. I'll also test 68bd30d specifically, which unless I'm wrong is this PR's parent. This PR might not be the one that introduced the problem after all.

chulucninh09 · 2026-02-25T09:32:44Z

I checkout to your branch and merge with the main branch at 0bf7043 and got problem when running Qwen3 Coder Next.

slot apply_checkp: id  2 | task 29497 | n_past = 14974, slot.prompt.tokens.size() = 14998, seq_id = 2, pos_min = 14997                      
slot apply_checkp: id  2 | task 29497 | restored context checkpoint took  22.50 ms (pos_min = 13591, pos_max = 13591, size = 75.481 MiB)    
slot apply_checkp: id  2 | task 29497 | erased invalidated context checkpoint (pos_min = 14997, pos_max = 14997, size = 75.491 MiB)         
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="129345781129216" timestamp=1772011504 id_slot=2 id_task=29497 p0=13592         
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="129345781129216" timestamp=1772011507 id_slot=2 id_task=29497 p0=15640         
slot print_timing: id  2 | task 29497 |                                                                                                     
prompt eval time =    4453.57 ms /  2798 tokens (    1.59 ms per token,   628.26 tokens per second)                                         
       eval time =       0.00 ms /     1 tokens (    0.00 ms per token, 1000000.00 tokens per second)                                       
      total time =    4453.58 ms /  2799 tokens                                                                                             
INFO [      log_server_request] request | tid="129291793887232" timestamp=1772011508 remote_addr="127.0.0.1" remote_port=35972 status=200 me
thod="POST" path="/v1/chat/completions" params={}                                                                                           
slot create_check: id  2 | task 29497 | created context checkpoint 8 of 8 took 111.94 ms (pos_min = 16389, pos_max = 16389, size = 75.502 Mi
B)

create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache

firecoperana · 2026-02-25T13:52:05Z

These two are thinking models. Did the old prompt get saved to ram when it is rebuilt from 0? Need to see your detailed logs to understand what's going on. I tried with Qwen3, but it worked fine.

Fair, I've been lazy with the tests on this one.

I'm using --reasoning-format none and handling thinking tags at the front end level using a custom jinja. This might sound strange but it's the only way to get every model to work both in thinking and non thinking mode. Thinking tags are treated as if they were normal text, and the Jinja outputs everything unprocessed.

Again this works for every model in main, Qwen3.5 included, but not in this PR. But I don't think the thinking is the problem, because I have a 12000 tokens system prompt, and in non hybrid models like minimax-m2.5, even if the thinking doesn't match, I would still expect those 12000 tokens to not be recalculated. There might be situations in which there is a second system prompt at the end of the conversation too, btw. But again, even if that didn't match everything else should still not need a rebuild for non recursive.

I'll compare everything with -v properly, size of the .bin file included. I'll also test 68bd30d specifically, which unless I'm wrong is this PR's parent. This PR might not be the one that introduced the problem after all.

Yes, with non hybrid model, processing from 0 does not look right. I just synced to latest main and removed some code that could cause different behavior. See if it fixed it.

firecoperana · 2026-02-25T14:01:26Z

I checkout to your branch and merge with the main branch at 0bf7043 and got problem when running Qwen3 Coder Next.

slot apply_checkp: id  2 | task 29497 | n_past = 14974, slot.prompt.tokens.size() = 14998, seq_id = 2, pos_min = 14997                      
slot apply_checkp: id  2 | task 29497 | restored context checkpoint took  22.50 ms (pos_min = 13591, pos_max = 13591, size = 75.481 MiB)    
slot apply_checkp: id  2 | task 29497 | erased invalidated context checkpoint (pos_min = 14997, pos_max = 14997, size = 75.491 MiB)         
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="129345781129216" timestamp=1772011504 id_slot=2 id_task=29497 p0=13592         
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="129345781129216" timestamp=1772011507 id_slot=2 id_task=29497 p0=15640         
slot print_timing: id  2 | task 29497 |                                                                                                     
prompt eval time =    4453.57 ms /  2798 tokens (    1.59 ms per token,   628.26 tokens per second)                                         
       eval time =       0.00 ms /     1 tokens (    0.00 ms per token, 1000000.00 tokens per second)                                       
      total time =    4453.58 ms /  2799 tokens                                                                                             
INFO [      log_server_request] request | tid="129291793887232" timestamp=1772011508 remote_addr="127.0.0.1" remote_port=35972 status=200 me
thod="POST" path="/v1/chat/completions" params={}                                                                                           
slot create_check: id  2 | task 29497 | created context checkpoint 8 of 8 took 111.94 ms (pos_min = 16389, pos_max = 16389, size = 75.502 Mi
B)

If you continue to send new prompt, does it work after that? I've seen it happened with other models, so it may not be related to this PR. It could also indicate some issues with the kv cache, but I'm not familiar with it. @ikawrakow Do you know what is wrong?

ikawrakow · 2026-02-25T14:41:50Z

Doesn't one need to take a snapshot after processing the system prompt but before any other tokens have been added? Otherwise the system prompt will always have to be re-processed, so people using long system prompts (as @MrHills-rs) will definitely notice.

MrHills-rs · 2026-02-25T15:12:06Z

Doesn't one need to take a snapshot after processing the system prompt but before any other tokens have been added? Otherwise the system prompt will always have to be re-processed, so people using long system prompts (as @MrHills-rs) will definitely notice.

Yes, noticed that checkpoints are created during TG and not at the end of PP, at least when --ctx-checkpoints- interval is used, not sure when is unset.

This also is a problem when starting a conversation with a very long user message, for example when you want to input a long text into the AI and ask questions about it. Or when loading a large old conversation. In both cases you're unable to swipe the first AI answer, which isn't optimal.

Ideally one would create checkpoints both during TG and PP, so when loading a large old conversation one can actually delete a few messages without having to rebuild.
Ideally you would make checkpoints every --ctx-checkpoints-interval gap during PP starting from (prompt length - checkpoints-interval * nr of checkpoints). No need to create checkpoints starting from 0 when the conversation is 100000 tokens long and --ctx-checkpoints-interval is only 1000.

Now thanks to everyone kv cache is extremely efficient, 131k f16 only being around 4GB + half a GB for every checkpoint with qwen3.5. Very long context are going to be increasingly common even for consumers, so I think this is quite important.

ikawrakow · 2026-02-25T16:02:41Z

@firecoperana

If you continue to send new prompt, does it work after that? I've seen it happened with other models, so it may not be related to this PR. It could also indicate some issues with the kv cache, but I'm not familiar with it. @ikawrakow Do you know what is wrong?

Do you have an easy reproduction?

chulucninh09 · 2026-02-25T16:38:05Z

I checkout to your branch and merge with the main branch at 0bf7043 and got problem when running Qwen3 Coder Next.

slot apply_checkp: id  2 | task 29497 | n_past = 14974, slot.prompt.tokens.size() = 14998, seq_id = 2, pos_min = 14997                      
slot apply_checkp: id  2 | task 29497 | restored context checkpoint took  22.50 ms (pos_min = 13591, pos_max = 13591, size = 75.481 MiB)    
slot apply_checkp: id  2 | task 29497 | erased invalidated context checkpoint (pos_min = 14997, pos_max = 14997, size = 75.491 MiB)         
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="129345781129216" timestamp=1772011504 id_slot=2 id_task=29497 p0=13592         
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="129345781129216" timestamp=1772011507 id_slot=2 id_task=29497 p0=15640         
slot print_timing: id  2 | task 29497 |                                                                                                     
prompt eval time =    4453.57 ms /  2798 tokens (    1.59 ms per token,   628.26 tokens per second)                                         
       eval time =       0.00 ms /     1 tokens (    0.00 ms per token, 1000000.00 tokens per second)                                       
      total time =    4453.58 ms /  2799 tokens                                                                                             
INFO [      log_server_request] request | tid="129291793887232" timestamp=1772011508 remote_addr="127.0.0.1" remote_port=35972 status=200 me
thod="POST" path="/v1/chat/completions" params={}                                                                                           
slot create_check: id  2 | task 29497 | created context checkpoint 8 of 8 took 111.94 ms (pos_min = 16389, pos_max = 16389, size = 75.502 Mi
B)

If you continue to send new prompt, does it work after that? I've seen it happened with other models, so it may not be related to this PR. It could also indicate some issues with the kv cache, but I'm not familiar with it. @ikawrakow Do you know what is wrong?

No, I ran my agent loop which just append new messages to the context and send over. After this the output is totally junk and I got the problem of leaking context (my agent A got agent B context, very weird).

I suspect this is related to some save/load and slot picking mechanism, because this happens when I ran 2-3 loops in parallel

firecoperana · 2026-02-25T16:40:51Z

It was deepseek v2.5. What fixed for me back then was to modify the prompt slightly and it generated the response normally. I thought it could just be a poor model and didn't investigate more.

@MrHills-rs The interval of checkpoint created during PP must be multiple of the batch size, but it's better than no checkpoint at all.

firecoperana · 2026-02-25T17:04:13Z

@chulucninh09 Can you pull the latest PR, which has a fix for prompts sent in parallel? #1303

MrHills-rs · 2026-02-25T19:22:14Z

@MrHills-rs The interval of checkpoint created during PP must be multiple of the batch size, but it's better than no checkpoint at all.

Yeah I suspected that. Unfortunately this means that if you save a checkpoint before the last batch you don't really know how many tokens before generation that includes, can be anything between 1 and batch size.

Well, in a fresh first prompt you'll have no checkpoints anyway, so you might as well save all of them after processing each of the last --ctx-checkpoints number of batches. Even if you have to redo a whole batch that's generally not that time consuming anyway, and it's far better then redoing a whole 100k+ tokens.

MrHills-rs · 2026-02-25T20:29:20Z

2338987 works. I can save and load caches from .bin, the cache isn't rebuilt completely after every message, both for minimax and qwen3.5.

I've had a little bit of a garbled output every once and a while, but that happened with main too, and it may very well be because of my really tight IQ2_XS quant. It gets really bad at long context, but no one expects a 2bpw model to do well at 40k ctx. Next week I'll have more ram and I'll test IQ4_XS in depth.

As for now, everything looks fine.

chulucninh09 · 2026-02-26T02:00:00Z

@firecoperana it still happening

INFO [                    main] build info | tid="133615975387136" timestamp=1772070517 build=4234 commit="26690abf"

======== Cache: cache_size = 0, n_past0 =  0, n_past1 =  0, n_past_prompt1 = 0,  n_past2 =  0, n_past_prompt2 =  0
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="133615975387136" timestamp=1772070614 id_slot=1 id_task=944 p0=0
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="133615975387136" timestamp=1772070616 id_slot=1 id_task=944 p0=4096
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="133615975387136" timestamp=1772070619 id_slot=1 id_task=944 p0=8192
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="133615975387136" timestamp=1772070622 id_slot=1 id_task=944 p0=12288
slot print_timing: id  1 | task 944 | 
prompt eval time =    9696.32 ms / 13231 tokens (    0.73 ms per token,  1364.54 tokens per second)
       eval time =       0.00 ms /     1 tokens (    0.00 ms per token, 1000000.00 tokens per second)
      total time =    9696.32 ms / 13232 tokens
INFO [      log_server_request] request | tid="133569092349952" timestamp=1772070624 remote_addr="115.75.215.166" remote_port=50608 status=200 method="POST" path="/v1/chat/completions" params={}
slot create_check: id  1 | task 944 | created context checkpoint 1 of 8 took 100.79 ms (pos_min = 13230, pos_max = 13230, size = 62.915 MiB)
INFO [           release_slots] slot released | tid="133615975387136" timestamp=1772070624 id_slot=1 id_task=944 n_ctx=360192 n_past=13231 n_system_tokens=0 n_cache_tokens=13231 truncated=false
INFO [              slots_idle] all slots are idle | tid="133615975387136" timestamp=1772070624
======== Prompt cache: cache size: 13231, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="133615975387136" timestamp=1772070625 id_slot=1 id_task=949
======== Cache: cache_size = 13231, n_past0 =  13231, n_past1 =  13231, n_past_prompt1 = 13231,  n_past2 =  13231, n_past_prompt2 =  13231
INFO [    batch_pending_prompt] we have to evaluate at least 1 token to generate logits | tid="133615975387136" timestamp=1772070625 id_slot=1 id_task=949
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="133615975387136" timestamp=1772070625 id_slot=1 id_task=949 p0=13230slot print_timing: id  1 | task 949 | 
prompt eval time =     626.17 ms /     1 tokens (  626.17 ms per token,     1.60 tokens per second)
       eval time =       0.00 ms /     1 tokens (    0.00 ms per token, 1000000.00 tokens per second)
      total time =     626.17 ms /     2 tokens
INFO [      log_server_request] request | tid="133569092349952" timestamp=1772070626 remote_addr="115.75.215.166" remote_port=50608 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [           release_slots] slot released | tid="133615975387136" timestamp=1772070626 id_slot=1 id_task=949 n_ctx=360192 n_past=13231 n_system_tokens=0 n_cache_tokens=13231 truncated=false
INFO [              slots_idle] all slots are idle | tid="133615975387136" timestamp=1772070626
======== Prompt cache: cache size: 13231, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="133615975387136" timestamp=1772070627 id_slot=1 id_task=951
======== Cache: cache_size = 13231, n_past0 =  13231, n_past1 =  13231, n_past_prompt1 = 13231,  n_past2 =  13231, n_past_prompt2 =  13231
INFO [    batch_pending_prompt] we have to evaluate at least 1 token to generate logits | tid="133615975387136" timestamp=1772070627 id_slot=1 id_task=951
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="133615975387136" timestamp=1772070627 id_slot=1 id_task=951 p0=13230
slot print_timing: id  1 | task 951 | 
prompt eval time =     570.76 ms /     1 tokens (  570.76 ms per token,     1.75 tokens per second)
       eval time =       0.00 ms /     1 tokens (    0.00 ms per token, 1000000.00 tokens per second)
      total time =     570.76 ms /     2 tokens
INFO [      log_server_request] request | tid="133569092349952" timestamp=1772070628 remote_addr="115.75.215.166" remote_port=50608 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [           release_slots] slot released | tid="133615975387136" timestamp=1772070628 id_slot=1 id_task=951 n_ctx=360192 n_past=13231 n_system_tokens=0 n_cache_tokens=13231 truncated=false

For more information, this is my git log, I checkout to yours and merge from main

commit 26690abfa1907adc1108e1ce75a8c9e021e2e7db (HEAD -> fcp/recurrent_checkpoint)
Merge: 216f4436 23389870
Author: Ninh Chu <chulucninh09@gmail.com>
Date:   Wed Feb 25 21:05:29 2026 +0000

    Merge branch 'fcp/recurrent_checkpoint' of https://github.com/ikawrakow/ik_llama.cpp into fcp/recurrent_checkpoint

commit 216f44363f00ee808df8d08d00ff6b7342dee2df (origin/main, origin/HEAD, main)
Author: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Date:   Wed Feb 25 18:07:12 2026 +0100

    Fix KT quantization yet again (#1321)
    
    * Fix KT quantization yet again
    
    * Add same 1e-16f check for all quants in iqk_uantize.cpp
    
    * Fixes for k-quants
    
    * Also this one

commit 233898704cc4198adf4bf27e2d1b42c2a998ef6a (origin/fcp/recurrent_checkpoint)
Author: firecoperana <firecoperana>
Date:   Sun Feb 22 15:10:56 2026 -0600

    server: enable checkpoint for recurrent models
    
    create checkpoint after cancel
    
    fix ban string and rm context during rewind
    
    add checkpoint interval
    
    only save recurrent cache

commit c77ec4b8b80ada495c40d888cc52e40c14bac547
Author: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Date:   Wed Feb 25 14:12:48 2026 +0100

    Fused delta-net (#1315)
    
    * Revive fused delta-net
    
    * Add command line argument for fused delta net
    
    * Simplify/improve CUDA delta-net
    
    * Add -fdn to llama-bench
    
    * More CUDA fused delta net optimizations
    
    * CPU optimizations

firecoperana · 2026-02-26T03:33:46Z

People reported that batch processing did not work very well as of now. Your issue may not be related to this PR.

* server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana>

* Better estimate for max. nuber of compute nodes * Just in case server: fix crash from adaptive p (ikawrakow#1304) Co-authored-by: firecoperana <firecoperana> Fix tool call for Qwen3.5 (ikawrakow#1300) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one Graph parallel for Qwen3-Next (ikawrakow#1292) * WIP * This works, but is slower than split mode layer Fix llm_arch_is_hybrid (ikawrakow#1305) Fix max nodes (again) (ikawrakow#1306) Fix typo in merge-up-gate-experts argument (ikawrakow#1311) llama-quantize: --dry-run option (ikawrakow#1309) Slightly better graph parallel for Qwen3-Next (ikawrakow#1307) * Make sure we pick the reduced tensor from the right GPU * Minor Minor delta-net tweak (ikawrakow#1308) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak adaptive p: collect probability before logit bias (ikawrakow#1314) server: propagate task index to response objects for batch requests (ikawrakow#1303) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai> Llama-quantize: Partial requant feature (ikawrakow#1313) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup Display the size of the tensors overriden during the tensor loading (ikawrakow#1318) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com> Fused delta-net (ikawrakow#1315) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name Fix KT quantization yet again (ikawrakow#1321) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one server: enable checkpoint for recurrent models (ikawrakow#1310) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana> Faster quantization for MoE models with many experts (ikawrakow#1322) Fused delta net 2 (ikawrakow#1320) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes. iAdding support for dense Qwen-3.5 models (ikawrakow#1326) add directio to llama-bench

firecoperana force-pushed the fcp/recurrent_checkpoint branch from 03c1862 to 8c298f8 Compare February 24, 2026 02:10

firecoperana mentioned this pull request Feb 24, 2026

Make string ban more robust and add regex ban #1243

Merged

4 tasks

firecoperana force-pushed the fcp/recurrent_checkpoint branch from 95555e2 to b82fe0f Compare February 25, 2026 13:11

server: enable checkpoint for recurrent models

2338987

create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache

firecoperana force-pushed the fcp/recurrent_checkpoint branch from b82fe0f to 2338987 Compare February 25, 2026 13:47

firecoperana mentioned this pull request Feb 25, 2026

Bug: qwen 3.5 full prompt bug #1319

Open

save checkpoint during pp

7962e9a

ikawrakow approved these changes Feb 26, 2026

View reviewed changes

ikawrakow merged commit 3fac78c into main Feb 26, 2026

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 26, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

47d8ba2

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 26, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

dfa1b76

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 26, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

df57472

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 27, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

2c2eb45

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 1, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

f24075f

firecoperana deleted the fcp/recurrent_checkpoint branch March 1, 2026 17:25

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 1, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

265e047

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 2, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

024ceb2

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 2, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

d7d0eae

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 3, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

1ec760c

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 5, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

81acf1c

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 5, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

5716774

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 5, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

fdb217c

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 7, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

5b683c8

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 8, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

b1adfa7

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 9, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

1404ae5

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 10, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

20d023f

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 10, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

8db9093

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 10, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

6d5d9ea

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 10, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

2edea3a

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 10, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

b252982

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 11, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

9c96575

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 11, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

1881aa5

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 12, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

7cca0b1

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 12, 2026

Revert "server: enable checkpoint for recurrent models (ikawrakow#1310"

a98822b

Conversation

firecoperana commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MrHills-rs commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

firecoperana commented Feb 24, 2026

Uh oh!

ikawrakow commented Feb 24, 2026

Uh oh!

SneedwareInc commented Feb 24, 2026

Uh oh!

firecoperana commented Feb 24, 2026

Uh oh!

MrHills-rs commented Feb 24, 2026

Uh oh!

firecoperana commented Feb 25, 2026

Uh oh!

MrHills-rs commented Feb 25, 2026

Uh oh!

chulucninh09 commented Feb 25, 2026

Uh oh!

firecoperana commented Feb 25, 2026

Uh oh!

firecoperana commented Feb 25, 2026

Uh oh!

ikawrakow commented Feb 25, 2026

Uh oh!

MrHills-rs commented Feb 25, 2026

Uh oh!

ikawrakow commented Feb 25, 2026

Uh oh!

chulucninh09 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

firecoperana commented Feb 25, 2026

Uh oh!

firecoperana commented Feb 25, 2026

Uh oh!

MrHills-rs commented Feb 25, 2026

Uh oh!

MrHills-rs commented Feb 25, 2026

Uh oh!

chulucninh09 commented Feb 26, 2026

Uh oh!

firecoperana commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

firecoperana commented Feb 24, 2026 •

edited

Loading

MrHills-rs commented Feb 24, 2026 •

edited

Loading

chulucninh09 commented Feb 25, 2026 •

edited

Loading