Skip to content

Comments

UPSTREAM PR #19493: server : speculative checkpointing#1163

Open
loci-dev wants to merge 4 commits intomainfrom
loci/pr-19493-feature-speculative-checkpointing
Open

UPSTREAM PR #19493: server : speculative checkpointing#1163
loci-dev wants to merge 4 commits intomainfrom
loci/pr-19493-feature-speculative-checkpointing

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#19493

This PR is a follow-up to #19270 (see #19267 ) to support the use of speculative decoding with recurrent modules using checkpoints. The use of checkpoints is not as fast as llama_memory_seq_rm, because in case of a partially accepted draft, we have to go back to the checkpoint and execute a shorter batch.

However, in use cases such as the quicksort example in #19164 , we observe a large speedup (in this very repetitive case!), so this PR.

This PR contains a small fix of the ngram-map-k implementation.

Questions / open tasks:

  • ngram-map-k uses the accept-feedback to shorten its drafts. I haven't looked how into how to execute a batch without sampling (this would be fine when repeating a shorter draft without reusing the speculative implementation).
  • To get better statistics we should distinguish between accepted and could be accepted tokens.
  • The creation of a checkpoint could be extracted into a common function (search for make room).
  • Is the use of llama_state_seq functions in this PR correct?

server log using Qwen3-Coder-Next, arguments --spec-type ngram-map-k --draft-max 48 --spec-ckpt-num-tries 2 --ctx-checkpoints 16 with quicksort prompts from #19164 :

print_info: general.name          = Qwen3-Coder-Next
[...]
srv    load_model: initializing slots, n_slots = 4
common_speculative_is_compat: the target context does not support partial sequence removal
srv    load_model: speculative decoding not supported by this context without checkpoints
[...]
prompt eval time =      59.95 ms /    20 tokens (    3.00 ms per token,   333.58 tokens per second)
       eval time =    1723.78 ms /   166 tokens (   10.38 ms per token,    96.30 tokens per second)
      total time =    1783.74 ms /   186 tokens
statistics ngram_map_k: #calls(b,g,a) = 1 165 0, #gen drafts = 0, #acc drafts = 0, #gen tokens = 0, #acc tokens = 0, dur(b,g,a) = 0.001, 0.029, 0.000 ms
slot      release: id  3 | task 0 | stop processing: n_tokens = 185, truncated = 0
[...]
prompt eval time =      47.36 ms /    14 tokens (    3.38 ms per token,   295.62 tokens per second)
       eval time =    1563.85 ms /   252 tokens (    6.21 ms per token,   161.14 tokens per second)
      total time =    1611.21 ms /   266 tokens
draft acceptance rate = 0.72414 (  126 accepted /   174 generated)
statistics ngram_map_k: #calls(b,g,a) = 2 291 3, #gen drafts = 4, #acc drafts = 3, #gen tokens = 192, #acc tokens = 126, dur(b,g,a) = 0.002, 0.076, 0.017 ms
slot      release: id  3 | task 167 | stop processing: n_tokens = 450, truncated = 0
[...]
prompt eval time =      48.04 ms /    15 tokens (    3.20 ms per token,   312.25 tokens per second)
       eval time =    2048.35 ms /   288 tokens (    7.11 ms per token,   140.60 tokens per second)
      total time =    2096.39 ms /   303 tokens
draft acceptance rate = 0.39186 (  154 accepted /   393 generated)
statistics ngram_map_k: #calls(b,g,a) = 3 428 9, #gen drafts = 15, #acc drafts = 9, #gen tokens = 677, #acc tokens = 280, dur(b,g,a) = 0.002, 0.150, 0.050 ms
slot      release: id  3 | task 295 | stop processing: n_tokens = 752, truncated = 0
[...]
prompt eval time =      45.51 ms /    15 tokens (    3.03 ms per token,   329.57 tokens per second)
       eval time =    1145.59 ms /   296 tokens (    3.87 ms per token,   258.38 tokens per second)
      total time =    1191.11 ms /   311 tokens
draft acceptance rate = 0.71171 (  237 accepted /   333 generated)
statistics ngram_map_k: #calls(b,g,a) = 4 488 16, #gen drafts = 24, #acc drafts = 16, #gen tokens = 1066, #acc tokens = 517, dur(b,g,a) = 0.003, 0.198, 0.082 ms
slot      release: id  3 | task 435 | stop processing: n_tokens = 1062, truncated = 0
[...]
slot print_timing: id  3 | task 497 | 
prompt eval time =      48.03 ms /    16 tokens (    3.00 ms per token,   333.15 tokens per second)
       eval time =    1063.58 ms /   284 tokens (    3.74 ms per token,   267.02 tokens per second)
      total time =    1111.60 ms /   300 tokens
draft acceptance rate = 0.62304 (  238 accepted /   382 generated)
statistics ngram_map_k: #calls(b,g,a) = 5 536 22, #gen drafts = 33, #acc drafts = 22, #gen tokens = 1498, #acc tokens = 755, dur(b,g,a) = 0.004, 0.251, 0.112 ms
slot      release: id  3 | task 497 | stop processing: n_tokens = 1361, truncated = 0

AI usage: Qwen3-Coder for auto-complete (common.h :-) ), some questions to MiniMax-M2.1.

@loci-review
Copy link

loci-review bot commented Feb 11, 2026

Overview

Analysis of 115,686 functions across 61 commits identified 144 modified functions (0.12%), 6 new functions, and 0 removed functions. Power consumption changes are negligible across all binaries, with the largest variation at -0.116% for llama-tts. All significant performance changes occur in non-critical initialization code paths, with no impact on inference operations.

Binary Power Consumption Changes:

  • build.bin.llama-tts: -0.116% (-420.88 nJ)
  • build.bin.llama-cvector-generator: -0.009% (-30.89 nJ)
  • build.bin.libllama.so: 0.000% (+1.04 nJ)
  • build.bin.libmtmd.so: -0.000% (-0.09 nJ)
  • All other binaries (llama-tokenize, llama-quantize, llama-qwen2vl-cli, libggml-cpu.so, libggml-base.so, libggml.so, llama-bench, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli): 0.000%

Function Analysis

CLI Argument Parsing Lambdas (arg.cpp operator(), llama-tts and llama-cvector-generator): Response time increased from 14.5ns to ~96ns (+557-560%), throughput time from 14.5ns to ~73ns (+402%). No source code changes detected. Regression caused by build system refactoring (commits 11fb327, 423bee4) that consolidated sanitizer flags, preventing lambda inlining in debug/CI builds. Affects only startup initialization, not inference.

STL Iterator Functions (multiple std::vector::begin/end, std::_Rb_tree methods): Mixed results with regressions of +214-307% throughput time (~180ns absolute increase) and improvements of -69 to -75% throughput time (~183ns absolute decrease). No source code changes; variations stem from compiler optimization differences between builds. Called during initialization (model loading, CLI parsing), not in inference loops.

Jinja Template Capability Detection (std::__invoke_r wrapper, cvector-generator): Throughput time increased +205% (+130ns), but response time increased only +0.75% (+1.7μs). Represents intentional correctness improvement transitioning from binary "requires_typed_content" flag to dual "supports_string_content" and "supports_typed_content" flags, enhancing template compatibility detection. Executes once during template loading.

Other analyzed functions (HTTP socket utilities, regex compilation) showed regressions in rarely-used initialization paths with negligible absolute impact (<200ns).

Additional Findings

Performance-critical inference paths remain unchanged: matrix operations (GEMM), attention mechanisms, quantization kernels, and KV cache management show zero modifications. GPU backend improvements across CUDA, Metal, and Vulkan (17 commits) include Flash Attention optimizations, bug fixes for non-contiguous tensors, and adaptive CPU/GPU work distribution, though these enhancements don't appear in function-level analysis as they affect backend-specific code. The 0.000% power consumption change in libllama.so confirms core inference operations are unaffected. Cross-function impact analysis reveals no propagation of regressions through call chains, with all changes isolated to their respective functions.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 7 times, most recently from 45aacad to 6e8718a Compare February 24, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants