UPSTREAM PR #19493: server : speculative checkpointing#1163
UPSTREAM PR #19493: server : speculative checkpointing#1163
Conversation
OverviewAnalysis of 115,686 functions across 61 commits identified 144 modified functions (0.12%), 6 new functions, and 0 removed functions. Power consumption changes are negligible across all binaries, with the largest variation at -0.116% for llama-tts. All significant performance changes occur in non-critical initialization code paths, with no impact on inference operations. Binary Power Consumption Changes:
Function AnalysisCLI Argument Parsing Lambdas (arg.cpp operator(), llama-tts and llama-cvector-generator): Response time increased from 14.5ns to ~96ns (+557-560%), throughput time from 14.5ns to ~73ns (+402%). No source code changes detected. Regression caused by build system refactoring (commits 11fb327, 423bee4) that consolidated sanitizer flags, preventing lambda inlining in debug/CI builds. Affects only startup initialization, not inference. STL Iterator Functions (multiple std::vector::begin/end, std::_Rb_tree methods): Mixed results with regressions of +214-307% throughput time (~180ns absolute increase) and improvements of -69 to -75% throughput time (~183ns absolute decrease). No source code changes; variations stem from compiler optimization differences between builds. Called during initialization (model loading, CLI parsing), not in inference loops. Jinja Template Capability Detection (std::__invoke_r wrapper, cvector-generator): Throughput time increased +205% (+130ns), but response time increased only +0.75% (+1.7μs). Represents intentional correctness improvement transitioning from binary "requires_typed_content" flag to dual "supports_string_content" and "supports_typed_content" flags, enhancing template compatibility detection. Executes once during template loading. Other analyzed functions (HTTP socket utilities, regex compilation) showed regressions in rarely-used initialization paths with negligible absolute impact (<200ns). Additional FindingsPerformance-critical inference paths remain unchanged: matrix operations (GEMM), attention mechanisms, quantization kernels, and KV cache management show zero modifications. GPU backend improvements across CUDA, Metal, and Vulkan (17 commits) include Flash Attention optimizations, bug fixes for non-contiguous tensors, and adaptive CPU/GPU work distribution, though these enhancements don't appear in function-level analysis as they affect backend-specific code. The 0.000% power consumption change in libllama.so confirms core inference operations are unaffected. Cross-function impact analysis reveals no propagation of regressions through call chains, with all changes isolated to their respective functions. 🔎 Full breakdown: Loci Inspector. |
823244c to
bab7d39
Compare
45aacad to
6e8718a
Compare
Note
Source pull request: ggml-org/llama.cpp#19493
This PR is a follow-up to #19270 (see #19267 ) to support the use of speculative decoding with recurrent modules using checkpoints. The use of checkpoints is not as fast as
llama_memory_seq_rm, because in case of a partially accepted draft, we have to go back to the checkpoint and execute a shorter batch.However, in use cases such as the quicksort example in #19164 , we observe a large speedup (in this very repetitive case!), so this PR.
This PR contains a small fix of the
ngram-map-kimplementation.Questions / open tasks:
ngram-map-kuses the accept-feedback to shorten its drafts. I haven't looked how into how to execute a batch without sampling (this would be fine when repeating a shorter draft without reusing the speculative implementation).make room).llama_state_seqfunctions in this PR correct?server log using Qwen3-Coder-Next, arguments
--spec-type ngram-map-k --draft-max 48 --spec-ckpt-num-tries 2 --ctx-checkpoints 16with quicksort prompts from #19164 :AI usage: Qwen3-Coder for auto-complete (common.h :-) ), some questions to MiniMax-M2.1.