server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp

jacekpoplawski · 2026-05-11T01:31:46Z

Overview

Implemented as requested in #22826 (comment)

extract message_spans from chat templates
use the autoparser to support more chat templates
find the prompt token position before the latest user message
split prompt batching at that position
create a context checkpoint before the latest user input
avoid periodic mid-prompt checkpoints when that position is known
handle multimodal prompts when mapping text/template positions to server prompt tokens

Additional information

This is another chapter in my journey toward fixing forcing full prompt re-processing due to lack of cache data

My main goal is to increase the "responsiveness" of agentic coding in llama.cpp

I am currently testing this with the following command:

./bin/llama-server \
  -c 200000 \
  -m /mnt/models2/Qwen/3.6/Qwen3.6-27B-Q8_0.gguf \
  --host 0.0.0.0 \
  --jinja \
  -fa on \
  --keep 4096 \
  -b 8192 \
  --parallel 1 \
  --ctx-checkpoints 24 \
  --cache-ram 65536 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0 \
  --presence-penalty 0 \
  --repeat-penalty 1.0 \
  --spec-type ngram-mod \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  --chat-template-kwargs '{"preserve_thinking":true}' \
  --checkpoint-min-step 256

preserve_thinking really helps, without it, the prompt history changes, so there is always some reprocessing

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - code exploration/prototyping

jacekpoplawski · 2026-05-11T01:35:35Z

tested following way:

CUDA_VISIBLE_DEVICES=0,1,2 ./bin/llama-server -m /mnt/models2/Qwen/3.6/Qwen3.6-27B-UD-Q8_K_XL.gguf --host 0.0.0.0 --ctx-checkpoints 8 -b 8192 --spec-type ngram-mod --parallel 1 --top-p 0.95 --top-k 20 --min-p 0 --presence-penalty 0 --repeat-penalty 1.0

Details

main: starting the main loop...
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=14764, token_pos=3549
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 262144 tokens, 8589934592 est)
srv  get_availabl: prompt cache update took 0.01 ms
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 3562
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 3046, batch.n_tokens = 3046, progress = 0.855138
slot update_slots: id  0 | task 0 | n_tokens = 3046, memory_seq_rm [3046, end)
slot update_slots: id  0 | task 0 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 3549
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 3549, batch.n_tokens = 503, progress = 0.996350
slot update_slots: id  0 | task 0 | skip checkpoint at 3046, expected boundary before user input = 3549
slot update_slots: id  0 | task 0 | n_tokens = 3549, memory_seq_rm [3549, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 3558, batch.n_tokens = 9, progress = 0.998877
slot create_check: id  0 | task 0 | created context checkpoint 1 of 8 (pos_min = 3548, pos_max = 3548, n_tokens = 3549, size = 149.626 MiB)
slot update_slots: id  0 | task 0 | n_tokens = 3558, memory_seq_rm [3558, end)
slot init_sampler: id  0 | task 0 | init sampler, took 0.51 ms, tokens: text = 3562, total = 3562
slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 3562, batch.n_tokens = 4
slot update_slots: id  0 | task 0 | skip checkpoint at 3558, expected boundary before user input = 3549
begin: ngram_mod occupancy = 3517/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
~llama_io_write_device: allocated 'CUDA0' buffer 52.992 MiB
~llama_io_write_device: allocated 'CUDA1' buffer 49.875 MiB
~llama_io_write_device: allocated 'CUDA2' buffer 46.758 MiB
slot print_timing: id  0 | task 0 |
prompt eval time =    2163.37 ms /  3562 tokens (    0.61 ms per token,  1646.51 tokens per second)
       eval time =    3204.93 ms /    73 tokens (   43.90 ms per token,    22.78 tokens per second)
      total time =    5368.29 ms /  3635 tokens
draft acceptance rate = 0.01562 (    1 accepted /    64 generated)
statistics ngram_mod: #calls(b,g,a) = 1 71 1, #gen drafts = 1, #acc drafts = 1, #gen tokens = 64, #acc tokens = 1, dur(b,g,a) = 0.516, 0.120, 0.001 ms
slot      release: id  0 | task 0 | stop processing: n_tokens = 3634, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=15117, token_pos=3636
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.890 (> 0.100 thold), f_keep = 1.000
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 77 | processing task, is_child = 0
slot update_slots: id  0 | task 77 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 4085
slot update_slots: id  0 | task 77 | n_tokens = 3634, memory_seq_rm [3634, end)
slot update_slots: id  0 | task 77 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 3636
slot update_slots: id  0 | task 77 | prompt processing progress, n_tokens = 3636, batch.n_tokens = 2, progress = 0.890086
slot update_slots: id  0 | task 77 | skip checkpoint at 3634, expected boundary before user input = 3636
slot update_slots: id  0 | task 77 | n_tokens = 3636, memory_seq_rm [3636, end)
slot update_slots: id  0 | task 77 | prompt processing progress, n_tokens = 4081, batch.n_tokens = 445, progress = 0.999021
slot create_check: id  0 | task 77 | created context checkpoint 2 of 8 (pos_min = 3635, pos_max = 3635, n_tokens = 3636, size = 149.626 MiB)
slot update_slots: id  0 | task 77 | n_tokens = 4081, memory_seq_rm [4081, end)
slot init_sampler: id  0 | task 77 | init sampler, took 0.55 ms, tokens: text = 4085, total = 4085
slot update_slots: id  0 | task 77 | prompt processing done, n_tokens = 4085, batch.n_tokens = 4
slot update_slots: id  0 | task 77 | skip checkpoint at 4081, expected boundary before user input = 3636
begin: ngram_mod occupancy = 4028/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 77 |
prompt eval time =     495.71 ms /   451 tokens (    1.10 ms per token,   909.80 tokens per second)
       eval time =    3886.85 ms /    92 tokens (   42.25 ms per token,    23.67 tokens per second)
      total time =    4382.56 ms /   543 tokens
statistics ngram_mod: #calls(b,g,a) = 2 162 1, #gen drafts = 1, #acc drafts = 1, #gen tokens = 64, #acc tokens = 1, dur(b,g,a) = 1.079, 0.248, 0.001 ms
slot      release: id  0 | task 77 | stop processing: n_tokens = 4176, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=17241, token_pos=4178
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.248 (> 0.100 thold), f_keep = 1.000
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 172 | processing task, is_child = 0
slot update_slots: id  0 | task 172 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 16844
slot update_slots: id  0 | task 172 | n_tokens = 4176, memory_seq_rm [4176, end)
slot update_slots: id  0 | task 172 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 4178
slot update_slots: id  0 | task 172 | prompt processing progress, n_tokens = 4178, batch.n_tokens = 2, progress = 0.248041
slot update_slots: id  0 | task 172 | n_tokens = 4178, memory_seq_rm [4178, end)
slot update_slots: id  0 | task 172 | prompt processing progress, n_tokens = 12370, batch.n_tokens = 8192, progress = 0.734386
slot update_slots: id  0 | task 172 | n_tokens = 12370, memory_seq_rm [12370, end)
slot update_slots: id  0 | task 172 | 8192 tokens since last checkpoint at 3636, creating new checkpoint during processing at position 16328
slot update_slots: id  0 | task 172 | prompt processing progress, n_tokens = 16328, batch.n_tokens = 3958, progress = 0.969366
slot update_slots: id  0 | task 172 | skip checkpoint at 12370, expected boundary before user input = 4178
slot update_slots: id  0 | task 172 | n_tokens = 16328, memory_seq_rm [16328, end)
slot update_slots: id  0 | task 172 | prompt processing progress, n_tokens = 16840, batch.n_tokens = 512, progress = 0.999763
slot update_slots: id  0 | task 172 | skip checkpoint at 16328, expected boundary before user input = 4178
slot update_slots: id  0 | task 172 | n_tokens = 16840, memory_seq_rm [16840, end)
slot init_sampler: id  0 | task 172 | init sampler, took 2.74 ms, tokens: text = 16844, total = 16844
slot update_slots: id  0 | task 172 | prompt processing done, n_tokens = 16844, batch.n_tokens = 4
slot update_slots: id  0 | task 172 | skip checkpoint at 16840, expected boundary before user input = 4178
begin: ngram_mod occupancy = 14423/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 172 |
prompt eval time =    7495.20 ms / 12668 tokens (    0.59 ms per token,  1690.15 tokens per second)
       eval time =    3852.14 ms /    89 tokens (   43.28 ms per token,    23.10 tokens per second)
      total time =   11347.35 ms / 12757 tokens
statistics ngram_mod: #calls(b,g,a) = 3 250 1, #gen drafts = 1, #acc drafts = 1, #gen tokens = 64, #acc tokens = 1, dur(b,g,a) = 2.975, 0.394, 0.001 ms
slot      release: id  0 | task 172 | stop processing: n_tokens = 16932, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=57632, token_pos=16863
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.211 (> 0.100 thold), f_keep = 0.210
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 16932, total state size = 1208.199 MiB
srv          load:  - looking for better prompt, base f_keep = 0.210, sim = 0.211
srv        update:  - cache state: 1 prompts, 1507.452 MiB (limits: 8192.000 MiB, 262144 tokens, 262144 est)
srv        update:    - prompt 0x728630022c00:   16932 tokens, checkpoints:  2,  1507.452 MiB
srv  get_availabl: prompt cache update took 1136.85 ms
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 266 | processing task, is_child = 0
slot update_slots: id  0 | task 266 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 16875
slot update_slots: id  0 | task 266 | n_past = 3560, slot.prompt.tokens.size() = 16932, seq_id = 0, pos_min = 16931, n_swa = 0
slot update_slots: id  0 | task 266 | Checking checkpoint with [3635, 3635] against 3560...
slot update_slots: id  0 | task 266 | Checking checkpoint with [3548, 3548] against 3560...
slot update_slots: id  0 | task 266 | restored context checkpoint (pos_min = 3548, pos_max = 3548, n_tokens = 3549, n_past = 3549, size = 149.626 MiB)
slot update_slots: id  0 | task 266 | erased invalidated context checkpoint (pos_min = 3635, pos_max = 3635, n_tokens = 3636, n_swa = 0, pos_next = 3549, size = 149.626 MiB)
slot update_slots: id  0 | task 266 | n_tokens = 3549, memory_seq_rm [3549, end)
slot update_slots: id  0 | task 266 | prompt processing progress, n_tokens = 11741, batch.n_tokens = 8192, progress = 0.695763
slot update_slots: id  0 | task 266 | n_tokens = 11741, memory_seq_rm [11741, end)
slot update_slots: id  0 | task 266 | 8192 tokens since last checkpoint at 3549, creating new checkpoint during processing at position 16359
slot update_slots: id  0 | task 266 | prompt processing progress, n_tokens = 16359, batch.n_tokens = 4618, progress = 0.969422
slot update_slots: id  0 | task 266 | skip checkpoint at 11741, expected boundary before user input = 16863
slot update_slots: id  0 | task 266 | n_tokens = 16359, memory_seq_rm [16359, end)
slot update_slots: id  0 | task 266 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 16863
slot update_slots: id  0 | task 266 | prompt processing progress, n_tokens = 16863, batch.n_tokens = 504, progress = 0.999289
slot update_slots: id  0 | task 266 | skip checkpoint at 16359, expected boundary before user input = 16863
slot update_slots: id  0 | task 266 | n_tokens = 16863, memory_seq_rm [16863, end)
slot update_slots: id  0 | task 266 | prompt processing progress, n_tokens = 16871, batch.n_tokens = 8, progress = 0.999763
slot create_check: id  0 | task 266 | created context checkpoint 2 of 8 (pos_min = 16862, pos_max = 16862, n_tokens = 16863, size = 149.626 MiB)
slot update_slots: id  0 | task 266 | n_tokens = 16871, memory_seq_rm [16871, end)
slot init_sampler: id  0 | task 266 | init sampler, took 2.89 ms, tokens: text = 16875, total = 16875
slot update_slots: id  0 | task 266 | prompt processing done, n_tokens = 16875, batch.n_tokens = 4
slot update_slots: id  0 | task 266 | skip checkpoint at 16871, expected boundary before user input = 16863
begin: ngram_mod occupancy = 14593/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 266 |
prompt eval time =    8060.61 ms / 13326 tokens (    0.60 ms per token,  1653.22 tokens per second)
       eval time =    2922.18 ms /    71 tokens (   41.16 ms per token,    24.30 tokens per second)
      total time =   10982.80 ms / 13397 tokens
draft acceptance rate = 0.10938 (    7 accepted /    64 generated)
statistics ngram_mod: #calls(b,g,a) = 4 313 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 8, dur(b,g,a) = 4.865, 0.524, 0.002 ms
slot      release: id  0 | task 266 | stop processing: n_tokens = 16945, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=57981, token_pos=16947
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.570 (> 0.100 thold), f_keep = 1.000
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 336 | processing task, is_child = 0
slot update_slots: id  0 | task 336 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 29730
slot update_slots: id  0 | task 336 | n_tokens = 16945, memory_seq_rm [16945, end)
slot update_slots: id  0 | task 336 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 16947
slot update_slots: id  0 | task 336 | prompt processing progress, n_tokens = 16947, batch.n_tokens = 2, progress = 0.570030
slot update_slots: id  0 | task 336 | n_tokens = 16947, memory_seq_rm [16947, end)
slot update_slots: id  0 | task 336 | prompt processing progress, n_tokens = 25139, batch.n_tokens = 8192, progress = 0.845577
slot update_slots: id  0 | task 336 | n_tokens = 25139, memory_seq_rm [25139, end)
slot update_slots: id  0 | task 336 | 8192 tokens since last checkpoint at 16863, creating new checkpoint during processing at position 29214
slot update_slots: id  0 | task 336 | prompt processing progress, n_tokens = 29214, batch.n_tokens = 4075, progress = 0.982644
slot update_slots: id  0 | task 336 | skip checkpoint at 25139, expected boundary before user input = 16947
slot update_slots: id  0 | task 336 | n_tokens = 29214, memory_seq_rm [29214, end)
slot update_slots: id  0 | task 336 | prompt processing progress, n_tokens = 29726, batch.n_tokens = 512, progress = 0.999865
slot update_slots: id  0 | task 336 | skip checkpoint at 29214, expected boundary before user input = 16947
slot update_slots: id  0 | task 336 | n_tokens = 29726, memory_seq_rm [29726, end)
slot init_sampler: id  0 | task 336 | init sampler, took 4.74 ms, tokens: text = 29730, total = 29730
slot update_slots: id  0 | task 336 | prompt processing done, n_tokens = 29730, batch.n_tokens = 4
slot update_slots: id  0 | task 336 | skip checkpoint at 29726, expected boundary before user input = 16947
begin: ngram_mod occupancy = 26770/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 336 |
prompt eval time =    9569.65 ms / 12785 tokens (    0.75 ms per token,  1336.00 tokens per second)
       eval time =   20765.44 ms /   454 tokens (   45.74 ms per token,    21.86 tokens per second)
      total time =   30335.09 ms / 13239 tokens
statistics ngram_mod: #calls(b,g,a) = 5 766 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 8, dur(b,g,a) = 8.193, 1.324, 0.002 ms
slot      release: id  0 | task 336 | stop processing: n_tokens = 30183, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=109526, token_pos=30104
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.560 (> 0.100 thold), f_keep = 0.559
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 795 | processing task, is_child = 0
slot update_slots: id  0 | task 795 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 30119
slot update_slots: id  0 | task 795 | n_past = 16873, slot.prompt.tokens.size() = 30183, seq_id = 0, pos_min = 30182, n_swa = 0
slot update_slots: id  0 | task 795 | Checking checkpoint with [16862, 16862] against 16873...
slot update_slots: id  0 | task 795 | restored context checkpoint (pos_min = 16862, pos_max = 16862, n_tokens = 16863, n_past = 16863, size = 149.626 MiB)
slot update_slots: id  0 | task 795 | n_tokens = 16863, memory_seq_rm [16863, end)
slot update_slots: id  0 | task 795 | prompt processing progress, n_tokens = 25055, batch.n_tokens = 8192, progress = 0.831867
slot update_slots: id  0 | task 795 | n_tokens = 25055, memory_seq_rm [25055, end)
slot update_slots: id  0 | task 795 | 8192 tokens since last checkpoint at 16863, creating new checkpoint during processing at position 29603
slot update_slots: id  0 | task 795 | prompt processing progress, n_tokens = 29603, batch.n_tokens = 4548, progress = 0.982868
slot update_slots: id  0 | task 795 | skip checkpoint at 25055, expected boundary before user input = 30104
slot update_slots: id  0 | task 795 | n_tokens = 29603, memory_seq_rm [29603, end)
slot update_slots: id  0 | task 795 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 30104
slot update_slots: id  0 | task 795 | prompt processing progress, n_tokens = 30104, batch.n_tokens = 501, progress = 0.999502
slot update_slots: id  0 | task 795 | skip checkpoint at 29603, expected boundary before user input = 30104
slot update_slots: id  0 | task 795 | n_tokens = 30104, memory_seq_rm [30104, end)
slot update_slots: id  0 | task 795 | prompt processing progress, n_tokens = 30115, batch.n_tokens = 11, progress = 0.999867
slot create_check: id  0 | task 795 | created context checkpoint 3 of 8 (pos_min = 30103, pos_max = 30103, n_tokens = 30104, size = 149.626 MiB)
slot update_slots: id  0 | task 795 | n_tokens = 30115, memory_seq_rm [30115, end)
slot init_sampler: id  0 | task 795 | init sampler, took 5.31 ms, tokens: text = 30119, total = 30119
slot update_slots: id  0 | task 795 | prompt processing done, n_tokens = 30119, batch.n_tokens = 4
slot update_slots: id  0 | task 795 | skip checkpoint at 30115, expected boundary before user input = 30104
begin: ngram_mod occupancy = 27278/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 795 |
prompt eval time =    8979.40 ms / 13256 tokens (    0.68 ms per token,  1476.27 tokens per second)
       eval time =    3708.26 ms /    88 tokens (   42.14 ms per token,    23.73 tokens per second)
      total time =   12687.65 ms / 13344 tokens
draft acceptance rate = 0.12500 (    8 accepted /    64 generated)
statistics ngram_mod: #calls(b,g,a) = 6 845 3, #gen drafts = 3, #acc drafts = 3, #gen tokens = 192, #acc tokens = 16, dur(b,g,a) = 11.561, 4.845, 0.416 ms
slot      release: id  0 | task 795 | stop processing: n_tokens = 30206, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=109945, token_pos=30208
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.972 (> 0.100 thold), f_keep = 1.000
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 881 | processing task, is_child = 0
slot update_slots: id  0 | task 881 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 31092
slot update_slots: id  0 | task 881 | n_tokens = 30206, memory_seq_rm [30206, end)
slot update_slots: id  0 | task 881 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 30208
slot update_slots: id  0 | task 881 | prompt processing progress, n_tokens = 30208, batch.n_tokens = 2, progress = 0.971568
slot update_slots: id  0 | task 881 | n_tokens = 30208, memory_seq_rm [30208, end)
slot update_slots: id  0 | task 881 | prompt processing progress, n_tokens = 30576, batch.n_tokens = 368, progress = 0.983404
slot update_slots: id  0 | task 881 | n_tokens = 30576, memory_seq_rm [30576, end)
slot update_slots: id  0 | task 881 | prompt processing progress, n_tokens = 31088, batch.n_tokens = 512, progress = 0.999871
slot update_slots: id  0 | task 881 | skip checkpoint at 30576, expected boundary before user input = 30208
slot update_slots: id  0 | task 881 | n_tokens = 31088, memory_seq_rm [31088, end)
slot init_sampler: id  0 | task 881 | init sampler, took 4.97 ms, tokens: text = 31092, total = 31092
slot update_slots: id  0 | task 881 | prompt processing done, n_tokens = 31092, batch.n_tokens = 4
slot update_slots: id  0 | task 881 | skip checkpoint at 31088, expected boundary before user input = 30208
begin: ngram_mod occupancy = 27973/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 881 |
prompt eval time =     830.98 ms /   886 tokens (    0.94 ms per token,  1066.21 tokens per second)
       eval time =   36471.01 ms /   795 tokens (   45.88 ms per token,    21.80 tokens per second)
      total time =   37301.99 ms /  1681 tokens
statistics ngram_mod: #calls(b,g,a) = 7 1639 3, #gen drafts = 3, #acc drafts = 3, #gen tokens = 192, #acc tokens = 16, dur(b,g,a) = 15.050, 6.134, 0.416 ms
slot      release: id  0 | task 881 | stop processing: n_tokens = 31886, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=115431, token_pos=31825
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.946 (> 0.100 thold), f_keep = 0.945
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 1680 | processing task, is_child = 0
slot update_slots: id  0 | task 1680 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 31844
slot update_slots: id  0 | task 1680 | n_past = 30117, slot.prompt.tokens.size() = 31886, seq_id = 0, pos_min = 31885, n_swa = 0
slot update_slots: id  0 | task 1680 | Checking checkpoint with [30103, 30103] against 30117...
slot update_slots: id  0 | task 1680 | restored context checkpoint (pos_min = 30103, pos_max = 30103, n_tokens = 30104, n_past = 30104, size = 149.626 MiB)
slot update_slots: id  0 | task 1680 | n_tokens = 30104, memory_seq_rm [30104, end)
slot update_slots: id  0 | task 1680 | prompt processing progress, n_tokens = 31328, batch.n_tokens = 1224, progress = 0.983796
slot update_slots: id  0 | task 1680 | n_tokens = 31328, memory_seq_rm [31328, end)
slot update_slots: id  0 | task 1680 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 31825
slot update_slots: id  0 | task 1680 | prompt processing progress, n_tokens = 31825, batch.n_tokens = 497, progress = 0.999403
slot update_slots: id  0 | task 1680 | skip checkpoint at 31328, expected boundary before user input = 31825
slot update_slots: id  0 | task 1680 | n_tokens = 31825, memory_seq_rm [31825, end)
slot update_slots: id  0 | task 1680 | prompt processing progress, n_tokens = 31840, batch.n_tokens = 15, progress = 0.999874
slot create_check: id  0 | task 1680 | created context checkpoint 4 of 8 (pos_min = 31824, pos_max = 31824, n_tokens = 31825, size = 149.626 MiB)
slot update_slots: id  0 | task 1680 | n_tokens = 31840, memory_seq_rm [31840, end)
slot init_sampler: id  0 | task 1680 | init sampler, took 5.29 ms, tokens: text = 31844, total = 31844
slot update_slots: id  0 | task 1680 | prompt processing done, n_tokens = 31844, batch.n_tokens = 4
slot update_slots: id  0 | task 1680 | skip checkpoint at 31840, expected boundary before user input = 31825
begin: ngram_mod occupancy = 28823/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 1680 |
prompt eval time =    1590.88 ms /  1740 tokens (    0.91 ms per token,  1093.73 tokens per second)
       eval time =   13703.82 ms /   303 tokens (   45.23 ms per token,    22.11 tokens per second)
      total time =   15294.70 ms /  2043 tokens
draft acceptance rate = 0.01562 (    1 accepted /    64 generated)
statistics ngram_mod: #calls(b,g,a) = 8 1940 4, #gen drafts = 4, #acc drafts = 4, #gen tokens = 256, #acc tokens = 17, dur(b,g,a) = 18.595, 6.556, 0.417 ms
slot      release: id  0 | task 1680 | stop processing: n_tokens = 32146, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=116352, token_pos=32112
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.991 (> 0.100 thold), f_keep = 0.991
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 1987 | processing task, is_child = 0
slot update_slots: id  0 | task 1987 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 32124
slot update_slots: id  0 | task 1987 | n_past = 31842, slot.prompt.tokens.size() = 32146, seq_id = 0, pos_min = 32145, n_swa = 0
slot update_slots: id  0 | task 1987 | Checking checkpoint with [31824, 31824] against 31842...
slot update_slots: id  0 | task 1987 | restored context checkpoint (pos_min = 31824, pos_max = 31824, n_tokens = 31825, n_past = 31825, size = 149.626 MiB)
slot update_slots: id  0 | task 1987 | n_tokens = 31825, memory_seq_rm [31825, end)
slot update_slots: id  0 | task 1987 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 32112
slot update_slots: id  0 | task 1987 | prompt processing progress, n_tokens = 32112, batch.n_tokens = 287, progress = 0.999626
slot update_slots: id  0 | task 1987 | skip checkpoint at 31825, expected boundary before user input = 32112
slot update_slots: id  0 | task 1987 | n_tokens = 32112, memory_seq_rm [32112, end)
slot update_slots: id  0 | task 1987 | prompt processing progress, n_tokens = 32120, batch.n_tokens = 8, progress = 0.999875
slot create_check: id  0 | task 1987 | created context checkpoint 5 of 8 (pos_min = 32111, pos_max = 32111, n_tokens = 32112, size = 149.626 MiB)
slot update_slots: id  0 | task 1987 | n_tokens = 32120, memory_seq_rm [32120, end)
slot init_sampler: id  0 | task 1987 | init sampler, took 5.32 ms, tokens: text = 32124, total = 32124
slot update_slots: id  0 | task 1987 | prompt processing done, n_tokens = 32124, batch.n_tokens = 4
slot update_slots: id  0 | task 1987 | skip checkpoint at 32120, expected boundary before user input = 32112
begin: ngram_mod occupancy = 29157/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 1987 |
prompt eval time =     547.43 ms /   299 tokens (    1.83 ms per token,   546.18 tokens per second)
       eval time =    1478.92 ms /    34 tokens (   43.50 ms per token,    22.99 tokens per second)
      total time =    2026.36 ms /   333 tokens
statistics ngram_mod: #calls(b,g,a) = 9 1973 4, #gen drafts = 4, #acc drafts = 4, #gen tokens = 256, #acc tokens = 17, dur(b,g,a) = 22.176, 6.602, 0.417 ms
slot      release: id  0 | task 1987 | stop processing: n_tokens = 32157, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=116481, token_pos=32140
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 0.999
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 2024 | processing task, is_child = 0
slot update_slots: id  0 | task 2024 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 32154
slot update_slots: id  0 | task 2024 | n_past = 32122, slot.prompt.tokens.size() = 32157, seq_id = 0, pos_min = 32156, n_swa = 0
slot update_slots: id  0 | task 2024 | Checking checkpoint with [32111, 32111] against 32122...
slot update_slots: id  0 | task 2024 | restored context checkpoint (pos_min = 32111, pos_max = 32111, n_tokens = 32112, n_past = 32112, size = 149.626 MiB)
slot update_slots: id  0 | task 2024 | n_tokens = 32112, memory_seq_rm [32112, end)
slot update_slots: id  0 | task 2024 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 32140
slot update_slots: id  0 | task 2024 | prompt processing progress, n_tokens = 32140, batch.n_tokens = 28, progress = 0.999565
slot update_slots: id  0 | task 2024 | skip checkpoint at 32112, expected boundary before user input = 32140
slot update_slots: id  0 | task 2024 | n_tokens = 32140, memory_seq_rm [32140, end)
slot update_slots: id  0 | task 2024 | prompt processing progress, n_tokens = 32150, batch.n_tokens = 10, progress = 0.999876
slot update_slots: id  0 | task 2024 | n_tokens = 32150, memory_seq_rm [32150, end)
slot init_sampler: id  0 | task 2024 | init sampler, took 5.37 ms, tokens: text = 32154, total = 32154
slot update_slots: id  0 | task 2024 | prompt processing done, n_tokens = 32154, batch.n_tokens = 4
slot update_slots: id  0 | task 2024 | skip checkpoint at 32150, expected boundary before user input = 32140
begin: ngram_mod occupancy = 29214/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 2024 |
prompt eval time =     195.19 ms /    42 tokens (    4.65 ms per token,   215.18 tokens per second)
       eval time =    1266.99 ms /    29 tokens (   43.69 ms per token,    22.89 tokens per second)
      total time =    1462.18 ms /    71 tokens
statistics ngram_mod: #calls(b,g,a) = 10 2001 4, #gen drafts = 4, #acc drafts = 4, #gen tokens = 256, #acc tokens = 17, dur(b,g,a) = 25.750, 6.643, 0.417 ms
slot      release: id  0 | task 2024 | stop processing: n_tokens = 32182, truncated = 0
srv  update_slots: all slots are idle

ggml-gh-bot · 2026-05-11T01:35:51Z

Hi @jacekpoplawski, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

jacekpoplawski · 2026-05-11T02:00:14Z

CUDA_VISIBLE_DEVICES=0,1,2 ./bin/llama-server -m /mnt/models1/Google/gemma-4-31B-it-UD-Q8_K_XL.gguf --host 0.0.0.0 --ctx-checkpoints 8 -b 8192 --spec-type ngram-mod

Details

srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=13864, token_pos=3447
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 262144 tokens, 8589934592 est)
srv  get_availabl: prompt cache update took 0.01 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 3458
slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 2942, batch.n_tokens = 2942, progress = 0.850781
slot update_slots: id  3 | task 0 | n_tokens = 2942, memory_seq_rm [2942, end)
slot update_slots: id  3 | task 0 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 3447
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 3447, batch.n_tokens = 505, progress = 0.996819
slot update_slots: id  3 | task 0 | skip checkpoint at 2942, expected boundary before user input = 3447
slot update_slots: id  3 | task 0 | n_tokens = 3447, memory_seq_rm [3447, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 3454, batch.n_tokens = 7, progress = 0.998843
slot create_check: id  3 | task 0 | created context checkpoint 1 of 8 (pos_min = 0, pos_max = 3446, n_tokens = 3447, size = 2693.009 MiB)
slot update_slots: id  3 | task 0 | n_tokens = 3454, memory_seq_rm [3454, end)
slot init_sampler: id  3 | task 0 | init sampler, took 0.45 ms, tokens: text = 3458, total = 3458
slot update_slots: id  3 | task 0 | prompt processing done, n_tokens = 3458, batch.n_tokens = 4
slot update_slots: id  3 | task 0 | skip checkpoint at 3454, expected boundary before user input = 3447
begin: ngram_mod occupancy = 3409/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 0 |
prompt eval time =    4093.67 ms /  3458 tokens (    1.18 ms per token,   844.72 tokens per second)
       eval time =    3406.95 ms /    73 tokens (   46.67 ms per token,    21.43 tokens per second)
      total time =    7500.62 ms /  3531 tokens
draft acceptance rate = 0.06250 (    4 accepted /    64 generated)
statistics ngram_mod: #calls(b,g,a) = 1 68 1, #gen drafts = 1, #acc drafts = 1, #gen tokens = 64, #acc tokens = 4, dur(b,g,a) = 0.493, 0.098, 0.001 ms
slot      release: id  3 | task 0 | stop processing: n_tokens = 3530, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=13864, token_pos=3447
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.874 (> 0.100 thold), f_keep = 0.989
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 73 | processing task, is_child = 0
slot update_slots: id  3 | task 73 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 3993
slot update_slots: id  3 | task 73 | n_tokens = 3490, memory_seq_rm [3490, end)
slot update_slots: id  3 | task 73 | prompt processing progress, n_tokens = 3989, batch.n_tokens = 499, progress = 0.998998
slot update_slots: id  3 | task 73 | skip checkpoint at 3490, expected boundary before user input = 3447
slot update_slots: id  3 | task 73 | n_tokens = 3989, memory_seq_rm [3989, end)
slot init_sampler: id  3 | task 73 | init sampler, took 0.49 ms, tokens: text = 3993, total = 3993
slot update_slots: id  3 | task 73 | prompt processing done, n_tokens = 3993, batch.n_tokens = 4
slot update_slots: id  3 | task 73 | skip checkpoint at 3989, expected boundary before user input = 3447
begin: ngram_mod occupancy = 3945/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 73 |
prompt eval time =     486.93 ms /   503 tokens (    0.97 ms per token,  1033.01 tokens per second)
       eval time =    2965.44 ms /    62 tokens (   47.83 ms per token,    20.91 tokens per second)
      total time =    3452.37 ms /   565 tokens
statistics ngram_mod: #calls(b,g,a) = 2 129 1, #gen drafts = 1, #acc drafts = 1, #gen tokens = 64, #acc tokens = 4, dur(b,g,a) = 1.050, 0.165, 0.001 ms
slot      release: id  3 | task 73 | stop processing: n_tokens = 4054, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=13864, token_pos=3447
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.980 (> 0.100 thold), f_keep = 0.991
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 137 | processing task, is_child = 0
slot update_slots: id  3 | task 137 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 4097
slot update_slots: id  3 | task 137 | n_tokens = 4017, memory_seq_rm [4017, end)
slot update_slots: id  3 | task 137 | prompt processing progress, n_tokens = 4093, batch.n_tokens = 76, progress = 0.999024
slot update_slots: id  3 | task 137 | skip checkpoint at 4017, expected boundary before user input = 3447
slot update_slots: id  3 | task 137 | n_tokens = 4093, memory_seq_rm [4093, end)
slot init_sampler: id  3 | task 137 | init sampler, took 0.50 ms, tokens: text = 4097, total = 4097
slot update_slots: id  3 | task 137 | prompt processing done, n_tokens = 4097, batch.n_tokens = 4
slot update_slots: id  3 | task 137 | skip checkpoint at 4093, expected boundary before user input = 3447
begin: ngram_mod occupancy = 4071/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 137 |
prompt eval time =     122.85 ms /    80 tokens (    1.54 ms per token,   651.21 tokens per second)
       eval time =    6455.73 ms /   132 tokens (   48.91 ms per token,    20.45 tokens per second)
      total time =    6578.58 ms /   212 tokens
draft acceptance rate = 0.01562 (    1 accepted /    64 generated)
statistics ngram_mod: #calls(b,g,a) = 3 259 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 1.622, 0.334, 0.002 ms
slot      release: id  3 | task 137 | stop processing: n_tokens = 4228, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=13864, token_pos=3447
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.216 (> 0.100 thold), f_keep = 0.991
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 270 | processing task, is_child = 0
slot update_slots: id  3 | task 270 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 19367
slot update_slots: id  3 | task 270 | n_tokens = 4190, memory_seq_rm [4190, end)
slot update_slots: id  3 | task 270 | prompt processing progress, n_tokens = 12382, batch.n_tokens = 8192, progress = 0.639335
slot update_slots: id  3 | task 270 | n_tokens = 12382, memory_seq_rm [12382, end)
slot update_slots: id  3 | task 270 | 8192 tokens since last checkpoint at 3447, creating new checkpoint during processing at position 18851
slot update_slots: id  3 | task 270 | prompt processing progress, n_tokens = 18851, batch.n_tokens = 6469, progress = 0.973357
slot update_slots: id  3 | task 270 | skip checkpoint at 12382, expected boundary before user input = 3447
slot update_slots: id  3 | task 270 | n_tokens = 18851, memory_seq_rm [18851, end)
slot update_slots: id  3 | task 270 | prompt processing progress, n_tokens = 19363, batch.n_tokens = 512, progress = 0.999793
slot update_slots: id  3 | task 270 | skip checkpoint at 18851, expected boundary before user input = 3447
slot update_slots: id  3 | task 270 | n_tokens = 19363, memory_seq_rm [19363, end)
slot init_sampler: id  3 | task 270 | init sampler, took 2.81 ms, tokens: text = 19367, total = 19367
slot update_slots: id  3 | task 270 | prompt processing done, n_tokens = 19367, batch.n_tokens = 4
slot update_slots: id  3 | task 270 | skip checkpoint at 19363, expected boundary before user input = 3447
begin: ngram_mod occupancy = 15256/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 270 |
prompt eval time =   11658.13 ms / 15177 tokens (    0.77 ms per token,  1301.84 tokens per second)
       eval time =    3870.86 ms /    77 tokens (   50.27 ms per token,    19.89 tokens per second)
      total time =   15528.99 ms / 15254 tokens
statistics ngram_mod: #calls(b,g,a) = 4 335 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 3.670, 0.441, 0.002 ms
slot      release: id  3 | task 270 | stop processing: n_tokens = 19443, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=56567, token_pos=19234
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.180 (> 0.100 thold), f_keep = 0.178
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 19443, total state size = 5119.261 MiB
srv          load:  - looking for better prompt, base f_keep = 0.178, sim = 0.180
srv        update:  - cache state: 1 prompts, 7812.270 MiB (limits: 8192.000 MiB, 262144 tokens, 262144 est)
srv        update:    - prompt 0x5816dafca4c0:   19443 tokens, checkpoints:  1,  7812.270 MiB
srv  get_availabl: prompt cache update took 5754.02 ms
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 351 | processing task, is_child = 0
slot update_slots: id  3 | task 351 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 19244
slot update_slots: id  3 | task 351 | n_past = 3458, slot.prompt.tokens.size() = 19443, seq_id = 3, pos_min = 14835, n_swa = 1024
slot update_slots: id  3 | task 351 | Checking checkpoint with [0, 3446] against 2434...
slot update_slots: id  3 | task 351 | restored context checkpoint (pos_min = 0, pos_max = 3446, n_tokens = 3447, n_past = 3446, size = 2693.009 MiB)
slot update_slots: id  3 | task 351 | n_tokens = 3446, memory_seq_rm [3446, end)
slot update_slots: id  3 | task 351 | prompt processing progress, n_tokens = 11638, batch.n_tokens = 8192, progress = 0.604760
slot update_slots: id  3 | task 351 | n_tokens = 11638, memory_seq_rm [11638, end)
slot update_slots: id  3 | task 351 | prompt processing progress, n_tokens = 18728, batch.n_tokens = 7090, progress = 0.973186
slot update_slots: id  3 | task 351 | n_tokens = 18728, memory_seq_rm [18728, end)
slot update_slots: id  3 | task 351 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 19234
slot update_slots: id  3 | task 351 | prompt processing progress, n_tokens = 19234, batch.n_tokens = 506, progress = 0.999480
slot update_slots: id  3 | task 351 | skip checkpoint at 18728, expected boundary before user input = 19234
slot update_slots: id  3 | task 351 | n_tokens = 19234, memory_seq_rm [19234, end)
slot update_slots: id  3 | task 351 | prompt processing progress, n_tokens = 19240, batch.n_tokens = 6, progress = 0.999792
slot create_check: id  3 | task 351 | created context checkpoint 2 of 8 (pos_min = 14626, pos_max = 19233, n_tokens = 19234, size = 3600.054 MiB)
slot update_slots: id  3 | task 351 | n_tokens = 19240, memory_seq_rm [19240, end)
slot init_sampler: id  3 | task 351 | init sampler, took 3.19 ms, tokens: text = 19244, total = 19244
slot update_slots: id  3 | task 351 | prompt processing done, n_tokens = 19244, batch.n_tokens = 4
slot update_slots: id  3 | task 351 | skip checkpoint at 19240, expected boundary before user input = 19234
begin: ngram_mod occupancy = 15428/4194304 (0.00)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 351 |
prompt eval time =   14532.32 ms / 15798 tokens (    0.92 ms per token,  1087.09 tokens per second)
       eval time =    1013.75 ms /    21 tokens (   48.27 ms per token,    20.72 tokens per second)
      total time =   15546.08 ms / 15819 tokens
statistics ngram_mod: #calls(b,g,a) = 5 355 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 5.697, 0.469, 0.002 ms
slot      release: id  3 | task 351 | stop processing: n_tokens = 19264, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=56567, token_pos=19234
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.590 (> 0.100 thold), f_keep = 0.999
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 377 | processing task, is_child = 0
slot update_slots: id  3 | task 377 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 32621
slot update_slots: id  3 | task 377 | n_tokens = 19244, memory_seq_rm [19244, end)
slot update_slots: id  3 | task 377 | prompt processing progress, n_tokens = 27436, batch.n_tokens = 8192, progress = 0.841053
slot update_slots: id  3 | task 377 | n_tokens = 27436, memory_seq_rm [27436, end)
slot update_slots: id  3 | task 377 | 8192 tokens since last checkpoint at 19234, creating new checkpoint during processing at position 32105
slot update_slots: id  3 | task 377 | prompt processing progress, n_tokens = 32105, batch.n_tokens = 4669, progress = 0.984182
slot update_slots: id  3 | task 377 | skip checkpoint at 27436, expected boundary before user input = 19234
slot update_slots: id  3 | task 377 | n_tokens = 32105, memory_seq_rm [32105, end)
slot update_slots: id  3 | task 377 | prompt processing progress, n_tokens = 32617, batch.n_tokens = 512, progress = 0.999877
slot update_slots: id  3 | task 377 | skip checkpoint at 32105, expected boundary before user input = 19234
slot update_slots: id  3 | task 377 | n_tokens = 32617, memory_seq_rm [32617, end)
slot init_sampler: id  3 | task 377 | init sampler, took 4.94 ms, tokens: text = 32621, total = 32621
slot update_slots: id  3 | task 377 | prompt processing done, n_tokens = 32621, batch.n_tokens = 4
slot update_slots: id  3 | task 377 | skip checkpoint at 32617, expected boundary before user input = 19234
begin: ngram_mod occupancy = 28107/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 377 |
prompt eval time =   12903.45 ms / 13377 tokens (    0.96 ms per token,  1036.70 tokens per second)
       eval time =   33479.54 ms /   636 tokens (   52.64 ms per token,    19.00 tokens per second)
      total time =   46382.99 ms / 14013 tokens
statistics ngram_mod: #calls(b,g,a) = 6 990 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 9.193, 1.463, 0.002 ms
slot      release: id  3 | task 377 | stop processing: n_tokens = 33256, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=108386, token_pos=32971
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.989 (> 0.100 thold), f_keep = 0.981
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 1017 | processing task, is_child = 0
slot update_slots: id  3 | task 1017 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 32982
slot update_slots: id  3 | task 1017 | n_tokens = 32621, memory_seq_rm [32621, end)
slot update_slots: id  3 | task 1017 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 32971
slot update_slots: id  3 | task 1017 | prompt processing progress, n_tokens = 32971, batch.n_tokens = 350, progress = 0.999667
slot update_slots: id  3 | task 1017 | skip checkpoint at 32621, expected boundary before user input = 32971
slot update_slots: id  3 | task 1017 | n_tokens = 32971, memory_seq_rm [32971, end)
slot update_slots: id  3 | task 1017 | prompt processing progress, n_tokens = 32978, batch.n_tokens = 7, progress = 0.999879
slot create_check: id  3 | task 1017 | created context checkpoint 3 of 8 (pos_min = 28648, pos_max = 32970, n_tokens = 32971, size = 3377.394 MiB)
slot update_slots: id  3 | task 1017 | n_tokens = 32978, memory_seq_rm [32978, end)
slot init_sampler: id  3 | task 1017 | init sampler, took 5.53 ms, tokens: text = 32982, total = 32982
slot update_slots: id  3 | task 1017 | prompt processing done, n_tokens = 32982, batch.n_tokens = 4
slot update_slots: id  3 | task 1017 | skip checkpoint at 32978, expected boundary before user input = 32971
begin: ngram_mod occupancy = 28774/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 1017 |
prompt eval time =    2789.51 ms /   361 tokens (    7.73 ms per token,   129.41 tokens per second)
       eval time =    8041.81 ms /   155 tokens (   51.88 ms per token,    19.27 tokens per second)
      total time =   10831.32 ms /   516 tokens
statistics ngram_mod: #calls(b,g,a) = 7 1144 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 12.740, 1.720, 0.002 ms
slot      release: id  3 | task 1017 | stop processing: n_tokens = 33136, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=108386, token_pos=32971
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.997 (> 0.100 thold), f_keep = 0.999
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 1175 | processing task, is_child = 0
slot update_slots: id  3 | task 1175 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 33193
slot update_slots: id  3 | task 1175 | n_tokens = 33108, memory_seq_rm [33108, end)
slot update_slots: id  3 | task 1175 | prompt processing progress, n_tokens = 33189, batch.n_tokens = 81, progress = 0.999879
slot update_slots: id  3 | task 1175 | skip checkpoint at 33108, expected boundary before user input = 32971
slot update_slots: id  3 | task 1175 | n_tokens = 33189, memory_seq_rm [33189, end)
slot init_sampler: id  3 | task 1175 | init sampler, took 5.50 ms, tokens: text = 33193, total = 33193
slot update_slots: id  3 | task 1175 | prompt processing done, n_tokens = 33193, batch.n_tokens = 4
slot update_slots: id  3 | task 1175 | skip checkpoint at 33189, expected boundary before user input = 32971
begin: ngram_mod occupancy = 29007/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 1175 |
prompt eval time =     172.46 ms /    85 tokens (    2.03 ms per token,   492.86 tokens per second)
       eval time =    3546.63 ms /    69 tokens (   51.40 ms per token,    19.46 tokens per second)
      total time =    3719.09 ms /   154 tokens
statistics ngram_mod: #calls(b,g,a) = 8 1212 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 16.295, 1.829, 0.002 ms
slot      release: id  3 | task 1175 | stop processing: n_tokens = 33261, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=108386, token_pos=32971
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.972 (> 0.100 thold), f_keep = 0.999
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 1246 | processing task, is_child = 0
slot update_slots: id  3 | task 1246 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 34193
slot update_slots: id  3 | task 1246 | n_tokens = 33230, memory_seq_rm [33230, end)
slot update_slots: id  3 | task 1246 | prompt processing progress, n_tokens = 33677, batch.n_tokens = 447, progress = 0.984909
slot update_slots: id  3 | task 1246 | n_tokens = 33677, memory_seq_rm [33677, end)
slot update_slots: id  3 | task 1246 | prompt processing progress, n_tokens = 34189, batch.n_tokens = 512, progress = 0.999883
slot update_slots: id  3 | task 1246 | skip checkpoint at 33677, expected boundary before user input = 32971
slot update_slots: id  3 | task 1246 | n_tokens = 34189, memory_seq_rm [34189, end)
slot init_sampler: id  3 | task 1246 | init sampler, took 5.74 ms, tokens: text = 34193, total = 34193
slot update_slots: id  3 | task 1246 | prompt processing done, n_tokens = 34193, batch.n_tokens = 4
slot update_slots: id  3 | task 1246 | skip checkpoint at 34189, expected boundary before user input = 32971
begin: ngram_mod occupancy = 30013/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
slot print_timing: id  3 | task 1246 |
prompt eval time =    1217.33 ms /   963 tokens (    1.26 ms per token,   791.08 tokens per second)
       eval time =   23152.00 ms /   443 tokens (   52.26 ms per token,    19.13 tokens per second)
      total time =   24369.32 ms /  1406 tokens
statistics ngram_mod: #calls(b,g,a) = 9 1654 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 19.960, 2.572, 0.002 ms
slot      release: id  3 | task 1246 | stop processing: n_tokens = 34635, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-gemma4
srv  params_from_: message_spans: last user boundary: byte_pos=113544, token_pos=34470
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.956 (> 0.100 thold), f_keep = 0.952
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  3 | task 1692 | processing task, is_child = 0
slot update_slots: id  3 | task 1692 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 34487
slot update_slots: id  3 | task 1692 | n_tokens = 32982, memory_seq_rm [32982, end)
slot update_slots: id  3 | task 1692 | prompt processing progress, n_tokens = 33971, batch.n_tokens = 989, progress = 0.985038
slot update_slots: id  3 | task 1692 | n_tokens = 33971, memory_seq_rm [33971, end)
slot update_slots: id  3 | task 1692 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 34470
slot update_slots: id  3 | task 1692 | prompt processing progress, n_tokens = 34470, batch.n_tokens = 499, progress = 0.999507
slot update_slots: id  3 | task 1692 | skip checkpoint at 33971, expected boundary before user input = 34470
slot update_slots: id  3 | task 1692 | n_tokens = 34470, memory_seq_rm [34470, end)
slot update_slots: id  3 | task 1692 | prompt processing progress, n_tokens = 34483, batch.n_tokens = 13, progress = 0.999884
slot create_check: id  3 | task 1692 | created context checkpoint 4 of 8 (pos_min = 30027, pos_max = 34469, n_tokens = 34470, size = 3471.146 MiB)
slot update_slots: id  3 | task 1692 | n_tokens = 34483, memory_seq_rm [34483, end)
slot init_sampler: id  3 | task 1692 | init sampler, took 5.74 ms, tokens: text = 34487, total = 34487
slot update_slots: id  3 | task 1692 | prompt processing done, n_tokens = 34487, batch.n_tokens = 4
slot update_slots: id  3 | task 1692 | skip checkpoint at 34483, expected boundary before user input = 34470
begin: ngram_mod occupancy = 30517/4194304 (0.01)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.0.196 200
reasoning-budget: activated, budget=2147483647 tokens
reasoning-budget: deactivated (natural end)
slot print_timing: id  3 | task 1692 |
prompt eval time =    4045.72 ms /  1505 tokens (    2.69 ms per token,   372.00 tokens per second)
       eval time =   36678.33 ms /   675 tokens (   54.34 ms per token,    18.40 tokens per second)
      total time =   40724.04 ms /  2180 tokens
statistics ngram_mod: #calls(b,g,a) = 10 2328 2, #gen drafts = 2, #acc drafts = 2, #gen tokens = 128, #acc tokens = 5, dur(b,g,a) = 23.688, 3.734, 0.002 ms
slot      release: id  3 | task 1692 | stop processing: n_tokens = 35161, truncated = 0
srv  update_slots: all slots are idle

ggerganov · 2026-05-11T18:34:02Z

Yes, that seems in a good direction. Have you done testing that it works as expected?

pwilkin · 2026-05-11T18:56:40Z

This needs autoparser dedicated support for split-marker detection; currently, this will assume that all autoparser models use the ChatML markers (<|im_start|> etc.), which is incorrect.

I'll try to submit the marker detection code ASAP.

aldehir · 2026-05-11T21:58:14Z


+    const auto message_spans = json_value(data, "message_spans", json::array());
+    if (message_spans.is_array()) {
+        int32_t last_user_pos = -1;


You can probably use 0 as the sentinel value here, since a checkpoint at pos 0 isn't useful. Should help clean up the other logic too.

aldehir · 2026-05-11T21:59:47Z

+
+            if ((size_t) last_user_pos <= prompt.size()) {
+                const std::string prefix = prompt.substr(0, (size_t) last_user_pos);
+                const auto prefix_tokens = common_tokenize(vocab, prefix, true, true);


Just a guess, but this will probably create incorrect checkpoints for multimodal models with at least one image in the prompt.

Yes, you are right, this breaks after the first image.

now it should be ok

jacekpoplawski · 2026-05-12T02:47:28Z

Yes, that seems in a good direction. Have you done testing that it works as expected?

It works stable for my usecase: pi, qwen 3.6 27B, 200k ctx, 24 checkpoints

With 8 checkpoints I was able to reproduce forcing full prompt re-processing... but with 24 I can work for hours without issues.

As @aldehir pointed out, this does not work correctly with multimodal prompts. I committed a fallback to the old mechanism for that case.

Should I add a switch to enable this new mechanism as an option, or should I try to support multimodal prompts as well?

I understand that the impact of this change is significant, but the benefits are also significant: agentic coding is much more responsive now.

corrm · 2026-05-13T18:41:05Z

Tested, model used: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF UD-Q4_K_XL.

--model /mnt/data/models/unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-UD-Q4_K_XL.gguf --ctx-size 136000 --n-gpu-layers 64 --threads 9 --parallel 1 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --repeat-penalty 1 --presence-penalty 0 --no-webui --mlock --context-shift

The message forcing full prompt re-processing due to lack of cache data never shows up in logs, and now almost all the time, I get a cache hit.

But there is a cache miss that happens only one time, and I can't reproduce it. I tried for 1h without being able to hit that again.

Great work! You saved my time and my electricity bill.

Edit:
Cache miss happes when all slots are idle:

srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-native
srv  params_from_: message_spans: last user boundary: byte_pos=382715, token_pos=104037
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.873 (> 0.100 thold), f_keep = 0.870
reasoning-budget: activated, budget=2147483647 tokens
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 24756 | processing task, is_child = 0
slot update_slots: id  0 | task 24756 | new prompt, n_ctx_slot = 136192, n_keep = 0, task.n_tokens = 104067
slot update_slots: id  0 | task 24756 | n_past = 90900, slot.prompt.tokens.size() = 104430, seq_id = 0, pos_min = 104429, n_swa = 0
slot update_slots: id  0 | task 24756 | Checking checkpoint with [100075, 100075] against 90900...
slot update_slots: id  0 | task 24756 | Checking checkpoint with [90865, 90865] against 90900...
slot update_slots: id  0 | task 24756 | restored context checkpoint (pos_min = 90865, pos_max = 90865, n_tokens = 90866, n_past = 90866, size = 149.626 MiB)
slot update_slots: id  0 | task 24756 | erased invalidated context checkpoint (pos_min = 100075, pos_max = 100075, n_tokens = 100076, n_swa = 0, pos_next = 90866, size = 149.626 MiB)
slot update_slots: id  0 | task 24756 | n_tokens = 90866, memory_seq_rm [90866, end)
slot update_slots: id  0 | task 24756 | prompt processing progress, n_tokens = 92914, batch.n_tokens = 2048, progress = 0.892829
slot update_slots: id  0 | task 24756 | n_tokens = 92914, memory_seq_rm [92914, end)
slot update_slots: id  0 | task 24756 | prompt processing progress, n_tokens = 94962, batch.n_tokens = 2048, progress = 0.912508
slot update_slots: id  0 | task 24756 | n_tokens = 94962, memory_seq_rm [94962, end)
slot update_slots: id  0 | task 24756 | prompt processing progress, n_tokens = 97010, batch.n_tokens = 2048, progress = 0.932188
slot update_slots: id  0 | task 24756 | n_tokens = 97010, memory_seq_rm [97010, end)
slot update_slots: id  0 | task 24756 | prompt processing progress, n_tokens = 99058, batch.n_tokens = 2048, progress = 0.951868
slot update_slots: id  0 | task 24756 | n_tokens = 99058, memory_seq_rm [99058, end)
slot update_slots: id  0 | task 24756 | 8192 tokens since last checkpoint at 90866, creating new checkpoint during processing at position 101106
slot update_slots: id  0 | task 24756 | prompt processing progress, n_tokens = 101106, batch.n_tokens = 2048, progress = 0.971547
slot update_slots: id  0 | task 24756 | skip checkpoint at 99058, expected boundary before user input = 104037
slot update_slots: id  0 | task 24756 | n_tokens = 101106, memory_seq_rm [101106, end)
slot update_slots: id  0 | task 24756 | 8192 tokens since last checkpoint at 90866, creating new checkpoint during processing at position 103154
slot update_slots: id  0 | task 24756 | prompt processing progress, n_tokens = 103154, batch.n_tokens = 2048, progress = 0.991227
slot update_slots: id  0 | task 24756 | skip checkpoint at 101106, expected boundary before user input = 104037
slot update_slots: id  0 | task 24756 | n_tokens = 103154, memory_seq_rm [103154, end)
slot update_slots: id  0 | task 24756 | 8192 tokens since last checkpoint at 90866, creating new checkpoint during processing at position 103551
slot update_slots: id  0 | task 24756 | prompt processing progress, n_tokens = 103551, batch.n_tokens = 397, progress = 0.995042
slot update_slots: id  0 | task 24756 | skip checkpoint at 103154, expected boundary before user input = 104037
slot update_slots: id  0 | task 24756 | n_tokens = 103551, memory_seq_rm [103551, end)
slot update_slots: id  0 | task 24756 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 104037
slot update_slots: id  0 | task 24756 | prompt processing progress, n_tokens = 104037, batch.n_tokens = 486, progress = 0.999712
slot update_slots: id  0 | task 24756 | skip checkpoint at 103551, expected boundary before user input = 104037
slot update_slots: id  0 | task 24756 | n_tokens = 104037, memory_seq_rm [104037, end)
slot update_slots: id  0 | task 24756 | prompt processing progress, n_tokens = 104063, batch.n_tokens = 26, progress = 0.999962
slot create_check: id  0 | task 24756 | created context checkpoint 7 of 32 (pos_min = 104036, pos_max = 104036, n_tokens = 104037, size = 149.626 MiB)
slot update_slots: id  0 | task 24756 | n_tokens = 104063, memory_seq_rm [104063, end)
slot init_sampler: id  0 | task 24756 | init sampler, took 11.12 ms, tokens: text = 104067, total = 104067
slot update_slots: id  0 | task 24756 | prompt processing done, n_tokens = 104067, batch.n_tokens = 4
slot update_slots: id  0 | task 24756 | skip checkpoint at 104063, expected boundary before user input = 104037
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

ggerganov · 2026-05-13T19:44:54Z

+                        // stop the prompt batch exactly before the latest user input, so a checkpoint
+                        // can be created at the conversation boundary
+                        if (checkpoint_before_last_user_token > 0 &&
+                            slot.prompt.n_tokens() == checkpoint_before_last_user_token) {


Just for testing, can you check that this works also with images:

Suggested change

slot.prompt.n_tokens() == checkpoint_before_last_user_token) {

slot.prompt.get_text_tokens().size() == checkpoint_before_last_user_token) {

And also the same change applied below on line 2752

I tried using slot.prompt.tokens.get_text_tokens().size(), but it didn’t help. I am now testing an approach that iterates over the full token sequence while skipping LLAMA_TOKEN_NULL, and it seems to work.

Arii02 · 2026-05-14T22:23:50Z

Just pulled that PR. When using Pi or OpenCode I still get forcing full prompt re-processing due to lack of cache data. Im running a Mac.

jacekpoplawski · 2026-05-15T04:18:11Z

Just pulled that PR. When using Pi or OpenCode I still get forcing full prompt re-processing due to lack of cache data. Im running a Mac.

Could you say a bit more and show some logs? I tested this fix for many hours (Qwen 3.6 27B), and forcing full prompt... appeared only at the beginning, when there was no cache yet, or when pi was doing some long task with over 30 reads/edits
What model do you use?

jacekpoplawski · 2026-05-16T05:32:27Z

@ggerganov Multimodal prompts are now supported. I will continue working on this, because the checkpoint is currently created only after the image is read for the second time, not the first time.

If you are happy with the direction of my changes, I’d like to add two arguments: one to set min_checkpoint_tokens (it’s probably too small now), and one to enable or disable the whole message_spans mechanism

Arii02 · 2026-05-16T13:21:17Z

Just pulled that PR. When using Pi or OpenCode I still get forcing full prompt re-processing due to lack of cache data. Im running a Mac.

Could you say a bit more and show some logs? I tested this fix for many hours (Qwen 3.6 27B), and forcing full prompt... appeared only at the beginning, when there was no cache yet, or when pi was doing some long task with over 30 reads/edits What model do you use?

I was using qwen3.5 122b. I sadly cant provide more logs as of right now.

ggerganov · 2026-05-16T13:30:35Z

@jacekpoplawski Is the message_spans argument needed? I think if the approach works correctly, it would never be better to disable it, so we can keep it always on.

aagit · 2026-05-26T10:31:42Z

Hi,

1. The On-Demand Checkpoint

For speculative KV-cache preloading, an on-demand checkpoint triggered by an abrupt TCP disconnect would probably fix it because you can get the information of "where" to do it while it's still running.

Race Condition Concern: I currently use id_slot=0 in the JSON payload to handle race conditions where a disconnect is slower than a new connection. Occasionally the llama server is slow at closing the connection, but the id_slot workaround is stable.
Limitation: While an on-demand checkpoint fixes the preloading logic, it does not solve the truncation problem described below.

2. The LRU Strategy & Truncation (The Core Issue)

My workflow involves a Context LRU system where rg-edit frequently modifies files within the context.

The Problem: The KV-cache is truncated by non linear file edits and the exact truncation point is unpredictable because it depends entirely on the LLM rewrite of the rg-edit buffers. Even if it was me editing the file I can't know in advance which byte I will modify.
Why On-Demand Fails Here: Since the next truncation point is only known after the previous kvcache computations has been completed, an on-demand checkpoint cannot be triggered before the checkpoint data is lost.
The Solution: I need periodic checkpoints (e.g., every $N$ tokens) to ensure that if truncation occurs, the KV-cache was preserved without having to recompute it from scratch.

How the LRU works:

Files that are actively edited are pushed to the back (bottom) of the context.
Readonly/unchanged files accumulate at the top.
This ensures the most relevant, actively worked-on files are at the bottom (where the transformer has the easiest time referencing them), also improving accuracy.
This strategy minimizes the frequency of full recomputations, but it still requires periodic checkpoints to handle the inevitable truncations happening frequently towards the end of context (can happen at offset 30k if I edit a non frequently edited file or more frequently around 100k).

3. Current Implementation & Optimization

Currently, I maintain 64 checkpoints with a frequency of 1 checkpoint every 2k tokens.

If the context exceeds 128k tokens and truncation occurs near context offset 0, a full recompute is required.
However, thanks to the LRU strategy, this scenario is statistically rare over time and can be prevented if needed, by simply by tweaking the checkpoint number of the max interval.

4. Potential Optimization: Exponential Checkpoint Frequency

A more complex strategy to further optimize this would be possible: Exponential Checkpoint Frequency.

Concept: Instead of a fixed interval, the checkpoint frequency increases exponentially as the context grows, ensuring a maximum distance of 2k tokens near the end of the context.
Trade-off: This is more complex to implement and maintain than a simple max_N_tokens interval. The benefit is primarily noticeable at the start of the context (where truncation is not happening frequently due to the LRU ordering).

Conclusion:
Given the complexity of implementing exponential frequency, I believe the current fixed-interval approach (1 checkpoint every 2k tokens, capped at 64 checkpoints) is a good balance. The LRU strategy in the client already mitigates the worst-case scenarios.

Before --checkpoint-every-n-tokens existed I already implemented it in a PR checkpoint at every logical batch size and back then my PR fixed hybrids for opencode too, so I guess opencode has the same requirements under the hood. It's just less apparent because with an agent you won't be able to tell immediately things don't work right when it starts to recompute from scratch.

Best regards.

ZacharyReis · 2026-05-26T16:24:01Z

So the functionality you actually need is a "system prompt checkpoint".

Exactly — --system-prompt-checkpoint would solve our case completely. Our system prompt is static across all turns (~10.8K tokens), so a single checkpoint there would let us restore that prefix and only re-evaluate the dynamic tail (~4-6K tokens of context + utterance). That would bring us back to the 2-4s prompt eval we had before this PR.

For what it's worth, the token position is stable and known ahead of time in chat assistant use cases — the system prompt doesn't change between turns, so the flag value can be set once at server launch.

I'd also echo @aagit's point that periodic checkpoints (--checkpoint-every-n-tokens) alongside the new message-boundary logic would cover a broader set of use cases. The two approaches aren't mutually exclusive — message-boundary checkpoints optimize the common chat case, periodic checkpoints handle the unpredictable-truncation case, and a system-prompt checkpoint handles the stable-prefix case.

Looking forward to the follow-up PR.

kripper · 2026-05-26T18:30:20Z

Is this supposed to somehow help #22907 (comment) ?

aagit · 2026-05-26T19:35:11Z

Is this supposed to somehow help #22907 (comment) ?

I see the checkpoint on TCP disconnect was already proposed before for the timeout case.

Unless there's other cons I'm missing, such option to checkpoint on TCP disconnect would save me in average about half a second to 1 second per speculative background context preload interruption. So I'd certainly like that too.

It's a smaller issue than lack of --checkpoint-max-step though.

* common : add common_chat_split_by_role * cont : fix spans to reach end of message * server: fix checkpoints creation - extract message_spans from chat templates - find the prompt token position before the latest user message - split prompt batching at that position - create a context checkpoint before the latest user input - avoid periodic mid-prompt checkpoints when that position is known - handle multimodal prompts when mapping text/template positions to server prompt tokens - add --checkpoint-min-step to control minimum spacing between checkpoints * cont : clean-up * Support autoparser detection for message barriers * server: fix message span delimiter and update docs --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>

wollastonzhu · 2026-05-28T01:34:59Z

由于我的英文不太好，且母语是中文，以下内容使用Qwen协助翻译生成：
Due to my poor English skills and my native language being Chinese, the following content has been generated through Qwen's assistance for translation.

Feature Request: Support mid-prompt periodic checkpoints for long-context retry resilience

Context

First, thank you for the great work on #22929! The message-boundary checkpoint logic is a smart optimization for typical chat workflows.

However, I'm encountering a limitation in a production scenario with very long contexts (150k+ tokens):

Current Behavior

Processing a 150k+ token prompt takes >90s, hitting my frontend/API timeout
The client automatically retries the identical request
With the current message-boundary-only checkpoint strategy, the retry starts from scratch (no mid-prompt checkpoint to resume from)
Result: infinite timeout loop for long prompts

Desired Behavior

Allow periodic mid-prompt checkpoints (e.g., every N tokens) even when message boundaries are known
On retry of the same request, the server could restore from the latest checkpoint and continue processing, avoiding full re-computation

Use Case Details

Setup: llama-server with -c 200000, long system prompts + repo context + user query
Timeout: 90s gateway/proxy limit (common in cloud deployments)
Retry pattern: Client sends identical request on timeout (idempotent by design)
Expected gain: With checkpoints every ~10-20k tokens, a retry after 90s could resume from ~80% progress instead of 0%

Related Feedback

This aligns with concerns raised by @aagit and @ZacharyReis in #22929 about:

Stable-prefix chat workflows losing cache hits
LRU context strategies needing predictable checkpoint intervals
Speculative preloading being interrupted without recovery points

Questions

Is there a recommended workaround for retry-resilient long-context processing with the current build?
Would the team be open to a PR adding --checkpoint-max-step or similar?
Alternatively, is there a server-side retry/resume mechanism I'm missing?

Thanks for considering this! 🙏

Adds a backstep-and-restore path that prevents `update_slots` from wiping the entire slot on every multi-turn request to QWEN35MOE / QWEN3NEXT (hybrid Gated DeltaNet + attention). The post-LCP cut is rounded back to the nearest chat-template turn boundary, the existing context-checkpoint infrastructure is reused to restore state when the recurrent backend rejects partial seq_rm, and a checkpoint is forced on every boundary token so a usable rescue point always exists. Drops per-turn cost from ~30s full re-eval to ~1s O(delta) on a 36k-token Qwen 3.6 27B test (~45x).

jacekpoplawski · 2026-05-28T09:49:49Z

@mscheurwater @ZacharyReis could you check #23808?

ggml-org#22929 creates a context checkpoint only before the last user message, so prompts with a stable prefix and content that changes between turns lose all checkpoint cache hits (the surviving checkpoints sit past the divergence point) and re-evaluate the full prompt every turn, notably on SWA models. Create a checkpoint before every user message instead, derived from the same message_spans. prompt_get_n_before_user() becomes prompt_get_user_boundaries(); the prefill batch breaks at each boundary and a checkpoint is allowed at any of them, still bounded by --checkpoint-min-step and --ctx-checkpoints.

ashirviskas · 2026-05-28T15:54:43Z

Would the team be open to a PR adding --checkpoint-max-step or similar?

I agree, my prompt cache numbers after this merge (b9105 before, now on b9396, Qwen3.6 35B A3B) dropped quite a bit due to me not being able to set this number anymore. It is very relevant in some agentic workloads when you're sending many requests with same instructions but different data in one message and the whole thing fits in <1000 tokens.

Before:

Cached / Total

Now:

Cached / Total

wollastonzhu · 2026-05-29T14:06:37Z

我觉得也说服不了各位。
只在补充一下自己的价值认识：

kvcache毕竟是可以减少上下文的处理时间的，属于是用显存空间换时间的策略。
尤其是超长上下文，例如我的3090+qwen27b，1200tokens/s的情况下，让模型分析代码的时候100k很常见，90秒左右的处理时间，越积极的缓存策略，可能省下来的时间越多。
并且目前在qwen3.6上观察到这种超长上下文的背景下，执行任务和修改代码的成功率也越高。
希望将来某一天能有人改掉这个策略。在超长上下文的中间可以checkpoint，避免中间出现异常的时候，下一次还要从0开始处理。毕竟将来使用智能体和超长上下文的场景会远多于几k上下文的段对话。

I don't think it can convince everyone either.
Just to add a bit to my understanding of my own value:

kvcache can indeed reduce the processing time of the context, which is a strategy of trading off memory space for time.
Especially for extremely long contexts, such as my 3090 + qwen27b with a 1200 tokens/s rate, it is very common for the model to analyze the code with 100k tokens. The processing time is around 90 seconds. The more proactive the caching strategy is, the more time it may save.
And currently, when observing this in qwen3.6, the success rate of executing tasks and modifying code is also higher in the context of such extremely long contexts.
I hope that one day someone will change this strategy. During the processing of extremely long contexts, checkpoints can be made to avoid having to start from scratch when encountering an abnormality. After all, the scenarios where intelligent agents and extremely long contexts are used will be far more common than several-kilobit context dialogues.

patrickzel · 2026-05-30T15:51:10Z

Regression: SWA model (Gemma 4) loses all checkpoint cache hits after this PR

Setup
* **Model:** Gemma 4 26B-A4B-it Q6_K_L (hybrid SWA + global attention, `n_swa = 1024`)

* **Use case:** Voice assistant with chat completions (`/v1/chat/completions`, `peg-gemma4` template)

* **Hardware:** RTX 5070 12GB, `--fit on`, single slot (`-np 1`)

* **Build:** b9037 (includes this PR)

* **Flags:** `-c 65536 -b 4096 -ub 1024 --fit on -fa on -np 1 --jinja -ctk bf16 -ctv bf16 --backend-sampling --metrics --checkpoint-min-step 256`
Prompt structure
system(~10.8K tokens, static persona — never changes)
→ user/assistant history (growing, but prior turns are stable prefix)
→ user: dynamic context block (~4K tokens, changes every turn — timestamps, health data, etc.)
→ user: utterance (~200 tokens)
The system prompt + conversation history is a stable prefix (~10.8K-16K tokens). Only the last two user messages (context + utterance) change between turns.

Before this PR

With --checkpoint-every-n-tokens 32, periodic checkpoints were created throughout the prompt. At least one existed within the stable system prompt prefix. On the next turn, that checkpoint could be restored and only the changed tail (~4-6K tokens) was re-evaluated. Cache worked — typical prompt eval ~2-4s.

After this PR

--checkpoint-min-step 256 with message-boundary placement creates checkpoints only near the end of the prompt (at user message boundaries). The stable system prompt prefix (0-10.8K) gets zero checkpoints.

On the next turn, the LCP correctly identifies ~10.8K matching tokens, but all checkpoints are past the divergence point. Every checkpoint is invalidated and the server falls back to BOS:
sim_best = 0.774, f_keep = 0.774
Checking checkpoint with [14960, 17007] against 12199...   ← past divergence
Checking checkpoint with [13936, 15983] against 12199...   ← past divergence
Checking checkpoint with [12581, 14628] against 12199...   ← past divergence
Checking checkpoint with [0, 0] against 12199...           ← falls back to BOS
restored context checkpoint (pos_min = 0, pos_max = 0, n_tokens = 1)
erased invalidated context checkpoint (pos_min = 12581, ..., n_swa = 1024, pos_next = 1)
erased invalidated context checkpoint (pos_min = 13936, ..., n_swa = 1024, pos_next = 1)
erased invalidated context checkpoint (pos_min = 14960, ..., n_swa = 1024, pos_next = 1)
Result: Cache: 1 on every single turn. Full 12s prompt re-evaluation of ~17K tokens instead of ~2-4s.

Impact

Every conversation turn pays the full prefill cost. For a 17K token prompt at ~1900 tok/s, that's ~9-12 seconds of prompt processing per turn — making the assistant feel sluggish despite having a stable 10.8K token prefix that could be cached.

Root cause

The PR's logic creates checkpoints at user message boundaries near the end of the prompt and skips periodic mid-prompt checkpoints. For chat with a stable prefix and dynamic tail, this means:
1. The stable prefix (system prompt, 0-10.8K) gets no checkpoints

2. All checkpoints are in the dynamic region (12K+) that changes between turns

3. Every checkpoint is invalidated on every turn

4. No restoration is possible despite 50-85% LCP similarity
Suggested fix

Create a checkpoint at the system/history boundary (end of the system message or start of the first user message) in addition to the message-boundary checkpoints near the end. This is the natural "always-stable" checkpoint for chat — the system prompt doesn't change between turns, and a checkpoint there would allow restoring 10.8K tokens on every turn regardless of what changes in the conversation.

Alternatively, provide a flag to re-enable periodic checkpoints alongside message-boundary ones (similar to what @aagit is requesting above).

I'm also experiencing this issue. I use RAG and include the information in the message with a depth of 4. Because of the change made here, I lose the entire context.
One solution would be to set the number of user checkpoints so that you can simply save the last 5 or so. As it stands now, Llama Server only creates two checkpoints, and those are always discarded.

* common : add common_chat_split_by_role * cont : fix spans to reach end of message * server: fix checkpoints creation - extract message_spans from chat templates - find the prompt token position before the latest user message - split prompt batching at that position - create a context checkpoint before the latest user input - avoid periodic mid-prompt checkpoints when that position is known - handle multimodal prompts when mapping text/template positions to server prompt tokens - add --checkpoint-min-step to control minimum spacing between checkpoints * cont : clean-up * Support autoparser detection for message barriers * server: fix message span delimiter and update docs --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>

wollastonzhu · 2026-05-31T03:56:16Z

这个功能改变的潜在的负面影响还在陆陆续续暴露。
我一直没能理解这次改变的一般收益是什么？
基于AI应用场景的多样性，以及推理框架需要很高的兼容性，虽然我也希望KVcache能够更精准地仅存放静态的内容。但是就像公共交通工具（比如地铁、火车、bus）不能确定特殊需求人群的时候，先提供简单通用的板凳，恰恰是最好的。
是否能尽快改回最初的较为机械的KVcahe的方式？今后特殊的优化功能应作为不同的场景分支。

The potential negative impacts of this functional change are still gradually coming to light.
I have never been able to understand what the general benefit of the change in 22929 is?
Due to the diversity of AI application scenarios and the high compatibility requirements of the reasoning framework, although I also hope that KVcache can more precisely only store static content. However, just like public transportation (such as subways, trains, buses) cannot ensure that special needs groups are accommodated, providing simple and universal benches first is actually the best approach.
Could we please revert to the original, more mechanical KVcahe method as soon as possible? In the future, the special optimization functions should be implemented as separate branches for different scenarios.

Die4Ever · 2026-05-31T05:11:54Z

Yeah sometimes you could have a really long system prompt with an ending that dynamically changes. Checkpoints by message boundaries might not always work well.

patrickzel · 2026-05-31T06:13:27Z

Yeah sometimes you could have a really long system prompt with an ending that dynamically changes. Checkpoints by message boundaries might not always work well.

If the context needs to be recalculated, a checkpoint could be generated for each user message.
In my case, the context is recalculated every time, and only two checkpoints are created, which are then invalidated again.
This new behavior is a real deal-breaker, especially when loading existing chats.

wollastonzhu · 2026-06-01T01:15:31Z

我现在使用如下策略来适应新的KVcache逻辑：

前端应用是openclaw。
修改openclaw的模型配置，上下文窗口减少到之前的2/3，以减小处理提示词的任务量。
修改llamacpp的--parallel 1 ，防止对话并行导致提示词处理变慢，导致时间过长终端超时而没有触发对话边界。
增加了--cache-reuse 1024。具体cache的情况，本周再观察。

I am currently adopting the following strategy to adapt to the new KVcache logic:

The front-end application is OpenClaw.
Modify the model configuration of OpenCLaw, reducing the context window to 2/3 of the previous size to reduce the workload of processing the prompt words.
Modify --parallel 1 in llamacpp to prevent parallel conversations from causing the prompt word processing to slow down, resulting in a long processing time that causes the terminal to time out without triggering the conversation boundary.
Added --cache-reuse 1024. The specific cache situation will be observed this week.

vikaskumarsingh123 · 2026-06-02T05:20:18Z

This PR has had huge negative impact on how I can use big models like the Qwen 3.5 122B to refactor codebases.
Earlier I could get it to process the large prompts with Zoo Code, but now I fall short just a few thousand tokens.

I could even use the mammoth Qwen 3.5 397B at 4 t/s TG for overnight tasks, but now it cant even process my initial prompt.

Many times it is not possible to change the timeout for the client (Zoo Code in this example) as there could be network timeouts on the way to the llama server.

I strongly second bringing back the --checkpoint-every-n-tokens CLI flag.

* common : add common_chat_split_by_role * cont : fix spans to reach end of message * server: fix checkpoints creation - extract message_spans from chat templates - find the prompt token position before the latest user message - split prompt batching at that position - create a context checkpoint before the latest user input - avoid periodic mid-prompt checkpoints when that position is known - handle multimodal prompts when mapping text/template positions to server prompt tokens - add --checkpoint-min-step to control minimum spacing between checkpoints * cont : clean-up * Support autoparser detection for message barriers * server: fix message span delimiter and update docs --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>

git merge's 3-way resolver did not flag two semantic duplicates in tools/server/server-context.cpp because the merge base did not contain either symbol. The duplicate bodies are byte-identical, so removing the second copy of each pair is semantically equivalent. Removed: - `const bool near_prompt_end` declaration at line 3653 (upstream side, e2ef8fe, "server: fix checkpoints creation" PR ggml-org#22929). - `static uint32_t server_n_outputs_max(...)` body at lines 219-232 (upstream side, de6f727, "llama: limit max outputs of llama_context" PR ggml-org#23861; one line modified by 5dcb711, "speculative: fix n_outputs_max and remove draft-simple auto-enable" PR ggml-org#23988). Kept: - The cache-side copies (72cfbcd), which match the cache-optimization chain the Stage 11 work is built on.

jacekpoplawski requested review from a team and pwilkin as code owners May 11, 2026 01:31

github-actions Bot added testing Everything test related examples server labels May 11, 2026

jacekpoplawski mentioned this pull request May 11, 2026

server: preserve context checkpoint coverage #22826

Open

jacekpoplawski marked this pull request as draft May 11, 2026 01:37

jacekpoplawski force-pushed the fix-checkpoints-creation branch from d878621 to ea9369c Compare May 11, 2026 01:40

jacekpoplawski changed the title ~~Fix checkpoints creation~~ server: fix checkpoints creation May 11, 2026

jacekpoplawski marked this pull request as ready for review May 11, 2026 02:03

aldehir reviewed May 11, 2026

View reviewed changes

jacekpoplawski mentioned this pull request May 13, 2026

Eval bug: forcing full prompt re-processing in Qwen3-Coder-Next #19394

Closed

ggerganov reviewed May 13, 2026

View reviewed changes

kripper mentioned this pull request May 13, 2026

Feature Request: Allow independent control of slots (-ns) and parallelism (-np) #22921

Open

jacekpoplawski mentioned this pull request May 14, 2026

Prompt cache is not reused for repeated identical chat/completions request with Qwen3.6-35B-A3B #23030

Open

bjahoor mentioned this pull request May 15, 2026

server : fix prompt-cache reuse for hybrid/recurrent models #23121

Closed

1 task

ggerganov self-assigned this May 16, 2026

iridium87 mentioned this pull request May 26, 2026

Eval bug: Checkpoints and MMProj on Gemma 4 consume abnormal amounts of RAM, leading to llama-server going OOM #21690

Open

jesseposner mentioned this pull request May 27, 2026

[WIP] Gemma 4 MTP #23398

Draft

jacekpoplawski mentioned this pull request May 28, 2026

server: system prompt checkpoint #23808

Closed

mfielding92 mentioned this pull request May 28, 2026

server : checkpoint before every user turn boundary #23814

Closed

fantasyz mentioned this pull request May 31, 2026

[Bug] qwen3next (Qwen3.6): context checkpoints never restored — full prompt re-processing on every turn ikawrakow/ik_llama.cpp#1762

Open

ggerganov mentioned this pull request Jun 3, 2026

server: avoid unnecessary checkpoint invalidation for recurrent / hybrid models #24035

Draft

patrickzel mentioned this pull request Jun 4, 2026

Misc. bug: Context checkpoints always invalidated on hybrid/recurrent models #24055

Open

	slot.prompt.n_tokens() == checkpoint_before_last_user_token) {
	slot.prompt.get_text_tokens().size() == checkpoint_before_last_user_token) {

Conversation

jacekpoplawski commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

jacekpoplawski commented May 11, 2026

Uh oh!

ggml-gh-bot Bot commented May 11, 2026

Uh oh!

jacekpoplawski commented May 11, 2026

Uh oh!

ggerganov commented May 11, 2026

Uh oh!

pwilkin commented May 11, 2026

Uh oh!

aldehir May 11, 2026

Choose a reason for hiding this comment

Uh oh!

aldehir May 11, 2026

Choose a reason for hiding this comment

Uh oh!

jacekpoplawski May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jacekpoplawski May 16, 2026

Choose a reason for hiding this comment

Uh oh!

jacekpoplawski commented May 12, 2026

Uh oh!

corrm commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov May 13, 2026

Choose a reason for hiding this comment

Uh oh!

jacekpoplawski May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Arii02 commented May 14, 2026

Uh oh!

jacekpoplawski commented May 15, 2026

Uh oh!

jacekpoplawski commented May 16, 2026

Uh oh!

Arii02 commented May 16, 2026

Uh oh!

ggerganov commented May 16, 2026

Uh oh!

aagit commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. The On-Demand Checkpoint

2. The LRU Strategy & Truncation (The Core Issue)

3. Current Implementation & Optimization

4. Potential Optimization: Exponential Checkpoint Frequency

Uh oh!

ZacharyReis commented May 26, 2026

Uh oh!

kripper commented May 26, 2026

Uh oh!

aagit commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wollastonzhu commented May 28, 2026

Feature Request: Support mid-prompt periodic checkpoints for long-context retry resilience

Context

Current Behavior

Desired Behavior

Use Case Details

Suggested Solutions (any would help)

Related Feedback

Questions

Uh oh!

jacekpoplawski commented May 28, 2026

Uh oh!

ashirviskas commented May 28, 2026

Uh oh!

wollastonzhu commented May 29, 2026

Uh oh!

jacekpoplawski commented May 11, 2026 •

edited

Loading

corrm commented May 13, 2026 •

edited

Loading

aagit commented May 26, 2026 •

edited

Loading

aagit commented May 26, 2026 •

edited

Loading

vikaskumarsingh123 commented Jun 2, 2026 •

edited

Loading