Skip to content

UPSTREAM PR #19261: spec : fix the check-rate logic of ngram-simple#1133

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-19261-gg-spec-simple-freq-check
Open

UPSTREAM PR #19261: spec : fix the check-rate logic of ngram-simple#1133
loci-dev wants to merge 2 commits intomainfrom
loci/pr-19261-gg-spec-simple-freq-check

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 2, 2026

Note

Source pull request: ggml-org/llama.cpp#19261

fix #19231

For the spec-simple method, we don't need to keep track of the last length to rate-limit the generations. We can simply use an incremental counter. This makes the speculator work with "Regenerate" of last message or branching the conversation from previous messages.

@loci-review
Copy link

loci-review bot commented Feb 2, 2026

Overview

Analysis of commit a104cff ("spec : fix the check-rate logic of ngram-simple") across llama.cpp reveals minimal performance impact. Of 115,425 total functions, only 2 were modified (0.0017%), with no new or removed functions. The changes fix a correctness issue in n-gram speculative decoding, introducing a 29ns throughput time increase (+5.03%) while maintaining response time impact at 0.11%.

Power Consumption by Binary:

  • build.bin.llama-tts: 360,886.57 nJ (+0.001%)
  • build.bin.llama-cvector-generator: 355,773.62 nJ (+0.001%)
  • build.bin.libllama.so: 249,105.58 nJ (-0.0%)
  • build.bin.libmtmd.so: 179,022.45 nJ (0.0%)
  • build.bin.llama-tokenize: 38,524.70 nJ (0.0%)
  • build.bin.llama-quantize: 43,714.74 nJ (0.0%)
  • build.bin.llama-qwen2vl-cli: 277.24 nJ (0.0%)
  • build.bin.llama-gemma3-cli: 277.24 nJ (0.0%)
  • build.bin.llama-gguf-split: 40,060.05 nJ (0.0%)
  • build.bin.llama-llava-cli: 277.24 nJ (0.0%)
  • build.bin.llama-minicpmv-cli: 277.24 nJ (0.0%)
  • build.bin.libggml.so: 5,124.39 nJ (0.0%)
  • build.bin.libggml-cpu.so: 157,685.86 nJ (0.0%)
  • build.bin.libggml-base.so: 73,208.69 nJ (0.0%)
  • build.bin.llama-bench: 60,119.52 nJ (0.0%)

Function Analysis

common_ngram_simple_draft (build.bin.llama-tts, build.bin.llama-cvector-generator):

  • Throughput: 575ns → 604ns (+29ns, +5.03%)
  • Response: 26,398ns → 26,427ns (+29ns, +0.11%)

The function refactored check-rate logic from position-based (idx_last_check + check_rate > cur_len) to counter-based (check_id++ >= check_rate), fixing unpredictable pattern matching behavior in speculative decoding. The 29ns increase results from additional counter increment/comparison operations. Changes improve code correctness and maintainability while maintaining negligible impact on overall inference performance (<0.03% of typical 10-100ms token generation time). The identical throughput and response time changes confirm the impact is isolated to the function body with no propagation to called functions.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 2 times, most recently from d9cffb7 to 1e94f5e Compare February 2, 2026 09:24
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 01000b6 to 4c1b7f6 Compare February 2, 2026 11:20
@loci-review
Copy link

loci-review bot commented Feb 2, 2026

Overview

Analysis of 115,469 functions across 15 binaries reveals minimal performance impact from speculative decoding refactoring. Modified: 80 functions (0.07%), new: 44, removed: 55, unchanged: 115,290.

Power consumption changes:

  • build.bin.llama-tts: -0.082% (-295 nJ)
  • build.bin.llama-cvector-generator: -0.102% (-364 nJ)
  • build.bin.libmtmd.so, build.bin.libllama.so: <0.001%
  • build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli: 0%

Function Analysis

common_speculative_state_ngram_simple::draft() (both binaries): Intentional optimization adding check-rate throttling. Throughput time increased +79% (+61ns) due to added counter logic, but response time improved -1% (-284ns to -294ns) by reducing frequency of expensive O(n²) pattern matching. Net positive optimization.

std::match_results::_M_establish_failed_match (llama-tts): Throughput time improved -53% (-81ns), response time -4% (-81ns). Benefit from std::regex::optimize flag adoption in codebase, improving regex automaton efficiency.

Standard library functions (various): Mixed compiler optimization effects. Improvements include std::_Rb_tree::end() (-75% throughput), json_sax_dom_callback_parser::null() (-75% throughput), std::_Function_handler::_M_invoke (-46% throughput). Regressions include std::vector::end() (+307% throughput, +183ns), nlohmann::json::get() (+307% throughput, +183ns), std::make_shared (+207% throughput, +130ns). All changes are in non-critical paths (initialization, text preprocessing, configuration loading) with absolute impacts under 200ns.

Additional Findings

Zero impact on performance-critical inference paths: matrix operations (70-90% of inference time), attention mechanisms, KV cache, quantization kernels, and all GPU backends remain unchanged. Changes isolated to CPU-side speculative decoding utilities in common library. No GPU/ML operations affected.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 19 times, most recently from cd152fa to ab12294 Compare February 3, 2026 11:18
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 073bd79 to 823244c Compare February 18, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments