server : speculative checkpointing by srogmann · Pull Request #19493 · ggml-org/llama.cpp

srogmann · 2026-02-10T22:49:09Z

This PR is a follow-up to #19270 (see #19267 ) to support the use of speculative decoding with recurrent modules using checkpoints. The use of checkpoints is not as fast as llama_memory_seq_rm, because in case of a partially accepted draft, we have to go back to the checkpoint and execute a shorter batch.

However, in use cases such as the quicksort example in #19164 , we observe a large speedup (in this very repetitive case!), so this PR.

This PR contains a small fix of the ngram-map-k implementation.

Questions / open tasks:

ngram-map-k uses the accept-feedback to shorten its drafts. I haven't looked how into how to execute a batch without sampling (this would be fine when repeating a shorter draft without reusing the speculative implementation).
To get better statistics we should distinguish between accepted and could be accepted tokens.
The creation of a checkpoint could be extracted into a common function (search for make room).
Is the use of llama_state_seq functions in this PR correct?

server log using Qwen3-Coder-Next, arguments --spec-type ngram-map-k --draft-max 48 --spec-ckpt-num-tries 2 --ctx-checkpoints 16 with quicksort prompts from #19164 :

print_info: general.name          = Qwen3-Coder-Next
[...]
srv    load_model: initializing slots, n_slots = 4
common_speculative_is_compat: the target context does not support partial sequence removal
srv    load_model: speculative decoding not supported by this context without checkpoints
[...]
prompt eval time =      59.95 ms /    20 tokens (    3.00 ms per token,   333.58 tokens per second)
       eval time =    1723.78 ms /   166 tokens (   10.38 ms per token,    96.30 tokens per second)
      total time =    1783.74 ms /   186 tokens
statistics ngram_map_k: #calls(b,g,a) = 1 165 0, #gen drafts = 0, #acc drafts = 0, #gen tokens = 0, #acc tokens = 0, dur(b,g,a) = 0.001, 0.029, 0.000 ms
slot      release: id  3 | task 0 | stop processing: n_tokens = 185, truncated = 0
[...]
prompt eval time =      47.36 ms /    14 tokens (    3.38 ms per token,   295.62 tokens per second)
       eval time =    1563.85 ms /   252 tokens (    6.21 ms per token,   161.14 tokens per second)
      total time =    1611.21 ms /   266 tokens
draft acceptance rate = 0.72414 (  126 accepted /   174 generated)
statistics ngram_map_k: #calls(b,g,a) = 2 291 3, #gen drafts = 4, #acc drafts = 3, #gen tokens = 192, #acc tokens = 126, dur(b,g,a) = 0.002, 0.076, 0.017 ms
slot      release: id  3 | task 167 | stop processing: n_tokens = 450, truncated = 0
[...]
prompt eval time =      48.04 ms /    15 tokens (    3.20 ms per token,   312.25 tokens per second)
       eval time =    2048.35 ms /   288 tokens (    7.11 ms per token,   140.60 tokens per second)
      total time =    2096.39 ms /   303 tokens
draft acceptance rate = 0.39186 (  154 accepted /   393 generated)
statistics ngram_map_k: #calls(b,g,a) = 3 428 9, #gen drafts = 15, #acc drafts = 9, #gen tokens = 677, #acc tokens = 280, dur(b,g,a) = 0.002, 0.150, 0.050 ms
slot      release: id  3 | task 295 | stop processing: n_tokens = 752, truncated = 0
[...]
prompt eval time =      45.51 ms /    15 tokens (    3.03 ms per token,   329.57 tokens per second)
       eval time =    1145.59 ms /   296 tokens (    3.87 ms per token,   258.38 tokens per second)
      total time =    1191.11 ms /   311 tokens
draft acceptance rate = 0.71171 (  237 accepted /   333 generated)
statistics ngram_map_k: #calls(b,g,a) = 4 488 16, #gen drafts = 24, #acc drafts = 16, #gen tokens = 1066, #acc tokens = 517, dur(b,g,a) = 0.003, 0.198, 0.082 ms
slot      release: id  3 | task 435 | stop processing: n_tokens = 1062, truncated = 0
[...]
slot print_timing: id  3 | task 497 | 
prompt eval time =      48.03 ms /    16 tokens (    3.00 ms per token,   333.15 tokens per second)
       eval time =    1063.58 ms /   284 tokens (    3.74 ms per token,   267.02 tokens per second)
      total time =    1111.60 ms /   300 tokens
draft acceptance rate = 0.62304 (  238 accepted /   382 generated)
statistics ngram_map_k: #calls(b,g,a) = 5 536 22, #gen drafts = 33, #acc drafts = 22, #gen tokens = 1498, #acc tokens = 755, dur(b,g,a) = 0.004, 0.251, 0.112 ms
slot      release: id  3 | task 497 | stop processing: n_tokens = 1361, truncated = 0

AI usage: Qwen3-Coder for auto-complete (common.h :-) ), some questions to MiniMax-M2.1.

ggerganov

I think this is good as a prototype, but we must find a way to encapsulate this logic in common/speculative. We should keep server clean from extra speculative-related logic so that it is easier to maintain and introduce new speculative approaches later on.

Qwen3-Coder for auto-complete

I also use this model for auto completion. Which IDE/client do you use?

srogmann · 2026-02-11T22:14:18Z

common/speculative.cpp should encapsulate the spec_ckpt_-variables and the logic.

Which IDE/client do you use?

For llama.cpp I use Neovim with the llama.vim plugin.

srogmann requested review from ggerganov and ngxson as code owners February 10, 2026 22:49

github-actions bot added examples server labels Feb 10, 2026

loci-dev mentioned this pull request Feb 11, 2026

UPSTREAM PR #19493: server : speculative checkpointing auroralabs-loci/llama.cpp#1163

Open

ggerganov reviewed Feb 11, 2026

View reviewed changes

srogmann added 5 commits February 16, 2026 22:18

server : speculative decoding using checkpoints

07747a0

server : fix draft check with checkpoints

c9fc6af

server : rename spec vars

2ee8599

server : log levels

2f31b2d

chore: update webui build output

0fa66c2

srogmann force-pushed the feature/speculative-checkpointing branch from c591189 to 0fa66c2 Compare February 16, 2026 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

server : speculative checkpointing#19493

server : speculative checkpointing#19493
srogmann wants to merge 5 commits intoggml-org:masterfrom
srogmann:feature/speculative-checkpointing

srogmann commented Feb 10, 2026

Uh oh!

ggerganov left a comment

Uh oh!

srogmann commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

srogmann commented Feb 10, 2026

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

srogmann commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants