Skip to content

Comments

server : speculative checkpointing#19493

Open
srogmann wants to merge 5 commits intoggml-org:masterfrom
srogmann:feature/speculative-checkpointing
Open

server : speculative checkpointing#19493
srogmann wants to merge 5 commits intoggml-org:masterfrom
srogmann:feature/speculative-checkpointing

Conversation

@srogmann
Copy link
Collaborator

This PR is a follow-up to #19270 (see #19267 ) to support the use of speculative decoding with recurrent modules using checkpoints. The use of checkpoints is not as fast as llama_memory_seq_rm, because in case of a partially accepted draft, we have to go back to the checkpoint and execute a shorter batch.

However, in use cases such as the quicksort example in #19164 , we observe a large speedup (in this very repetitive case!), so this PR.

This PR contains a small fix of the ngram-map-k implementation.

Questions / open tasks:

  • ngram-map-k uses the accept-feedback to shorten its drafts. I haven't looked how into how to execute a batch without sampling (this would be fine when repeating a shorter draft without reusing the speculative implementation).
  • To get better statistics we should distinguish between accepted and could be accepted tokens.
  • The creation of a checkpoint could be extracted into a common function (search for make room).
  • Is the use of llama_state_seq functions in this PR correct?

server log using Qwen3-Coder-Next, arguments --spec-type ngram-map-k --draft-max 48 --spec-ckpt-num-tries 2 --ctx-checkpoints 16 with quicksort prompts from #19164 :

print_info: general.name          = Qwen3-Coder-Next
[...]
srv    load_model: initializing slots, n_slots = 4
common_speculative_is_compat: the target context does not support partial sequence removal
srv    load_model: speculative decoding not supported by this context without checkpoints
[...]
prompt eval time =      59.95 ms /    20 tokens (    3.00 ms per token,   333.58 tokens per second)
       eval time =    1723.78 ms /   166 tokens (   10.38 ms per token,    96.30 tokens per second)
      total time =    1783.74 ms /   186 tokens
statistics ngram_map_k: #calls(b,g,a) = 1 165 0, #gen drafts = 0, #acc drafts = 0, #gen tokens = 0, #acc tokens = 0, dur(b,g,a) = 0.001, 0.029, 0.000 ms
slot      release: id  3 | task 0 | stop processing: n_tokens = 185, truncated = 0
[...]
prompt eval time =      47.36 ms /    14 tokens (    3.38 ms per token,   295.62 tokens per second)
       eval time =    1563.85 ms /   252 tokens (    6.21 ms per token,   161.14 tokens per second)
      total time =    1611.21 ms /   266 tokens
draft acceptance rate = 0.72414 (  126 accepted /   174 generated)
statistics ngram_map_k: #calls(b,g,a) = 2 291 3, #gen drafts = 4, #acc drafts = 3, #gen tokens = 192, #acc tokens = 126, dur(b,g,a) = 0.002, 0.076, 0.017 ms
slot      release: id  3 | task 167 | stop processing: n_tokens = 450, truncated = 0
[...]
prompt eval time =      48.04 ms /    15 tokens (    3.20 ms per token,   312.25 tokens per second)
       eval time =    2048.35 ms /   288 tokens (    7.11 ms per token,   140.60 tokens per second)
      total time =    2096.39 ms /   303 tokens
draft acceptance rate = 0.39186 (  154 accepted /   393 generated)
statistics ngram_map_k: #calls(b,g,a) = 3 428 9, #gen drafts = 15, #acc drafts = 9, #gen tokens = 677, #acc tokens = 280, dur(b,g,a) = 0.002, 0.150, 0.050 ms
slot      release: id  3 | task 295 | stop processing: n_tokens = 752, truncated = 0
[...]
prompt eval time =      45.51 ms /    15 tokens (    3.03 ms per token,   329.57 tokens per second)
       eval time =    1145.59 ms /   296 tokens (    3.87 ms per token,   258.38 tokens per second)
      total time =    1191.11 ms /   311 tokens
draft acceptance rate = 0.71171 (  237 accepted /   333 generated)
statistics ngram_map_k: #calls(b,g,a) = 4 488 16, #gen drafts = 24, #acc drafts = 16, #gen tokens = 1066, #acc tokens = 517, dur(b,g,a) = 0.003, 0.198, 0.082 ms
slot      release: id  3 | task 435 | stop processing: n_tokens = 1062, truncated = 0
[...]
slot print_timing: id  3 | task 497 | 
prompt eval time =      48.03 ms /    16 tokens (    3.00 ms per token,   333.15 tokens per second)
       eval time =    1063.58 ms /   284 tokens (    3.74 ms per token,   267.02 tokens per second)
      total time =    1111.60 ms /   300 tokens
draft acceptance rate = 0.62304 (  238 accepted /   382 generated)
statistics ngram_map_k: #calls(b,g,a) = 5 536 22, #gen drafts = 33, #acc drafts = 22, #gen tokens = 1498, #acc tokens = 755, dur(b,g,a) = 0.004, 0.251, 0.112 ms
slot      release: id  3 | task 497 | stop processing: n_tokens = 1361, truncated = 0

AI usage: Qwen3-Coder for auto-complete (common.h :-) ), some questions to MiniMax-M2.1.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good as a prototype, but we must find a way to encapsulate this logic in common/speculative. We should keep server clean from extra speculative-related logic so that it is easier to maintain and introduce new speculative approaches later on.

Qwen3-Coder for auto-complete

I also use this model for auto completion. Which IDE/client do you use?

@srogmann
Copy link
Collaborator Author

common/speculative.cpp should encapsulate the spec_ckpt_-variables and the logic.

Which IDE/client do you use?

For llama.cpp I use Neovim with the llama.vim plugin.

@srogmann srogmann force-pushed the feature/speculative-checkpointing branch from c591189 to 0fa66c2 Compare February 16, 2026 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants