Skip to content

UPSTREAM PR #19253: Feature/ngram map reasoning#1124

Open
loci-dev wants to merge 6 commits intomainfrom
loci/pr-19253-feature-ngram-map-reasoning
Open

UPSTREAM PR #19253: Feature/ngram map reasoning#1124
loci-dev wants to merge 6 commits intomainfrom
loci/pr-19253-feature-ngram-map-reasoning

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 1, 2026

Note

Source pull request: ggml-org/llama.cpp#19253

This pull request is a follow-up to #18471.

  • ngram-map: Remove outdated entries if reasoning of previous message has been removed.
  • ngram-map: An internal hash map which was missing in #18471 has been added.
  • statistics: Computation of t_begin and t_accept has been added.
  • docs/speculative.md: ngram-mod has been added (see #19164).

@loci-review
Copy link

loci-review bot commented Feb 1, 2026

Overview

Analysis of 115,474 functions across 15 binaries identified 126 modified, 147 new, and 49 removed functions. Changes focus on n-gram-based speculative decoding enhancements with minimal performance impact.

Power Consumption Changes:

  • build.bin.llama-tts: +0.246% (+885 nJ)
  • build.bin.llama-cvector-generator: +0.355% (+1,258 nJ)
  • All other binaries (build.bin.libmtmd.so, build.bin.libllama.so, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so): 0.0% change

Function Analysis

common_speculative_state_ngram_map_k::begin() (llama-tts, llama-cvector-generator): Response time increased from 7ns to ~98,000ns (+1,388,899%), throughput time from 7ns to 16ns (+133%). This is a correctness fix—the base version was a no-op (GGML_UNUSED(prompt)), causing stale n-gram data contamination. The target version properly calls common_ngram_map_begin() to initialize maps and clean outdated entries. The 98µs overhead occurs once per generation sequence, not per token.

common_ngram_map constructor (both binaries): Response time increased from 225ns to ~2,424ns (+977%), throughput time from 137ns to 176ns (+29%). Added hash map allocation (262,144 entries, ~1MB) via key_map.resize() to enable O(1) lookups instead of O(n) linear search. One-time initialization cost enables faster runtime performance.

common_speculative_begin() (both binaries): Response time increased from ~488ns to ~879ns (+80%), throughput time from 151ns to 225ns (+48%). Added microsecond-precision timing instrumentation (ggml_time_us()) for profiling initialization, draft generation, and acceptance phases. Provides essential observability for optimizing speculative decoding.

Standard library functions (std::_Rb_tree::end(), std::vector::begin/end(), std::_Hashtable::begin()) show 180-187ns increases (+214-310%) with no source code changes—compiler artifacts affecting non-critical initialization paths. Other analyzed functions showed negligible changes.

Additional Findings

All GPU backend libraries and core inference operations (GEMM, attention, quantization) remain unchanged. The modifications affect initialization only, not the per-token inference hot path. Changes enable 2-3x inference speedup through speculative decoding while introducing <0.4% power consumption increase. Five commits by Sascha Rogmann delivered correctness fixes, performance instrumentation, and hash map optimizations with well-justified trade-offs.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-review
Copy link

loci-review bot commented Feb 1, 2026

Overview

Analysis of 115,474 functions across 14 binaries reveals minor positive impact from speculative decoding enhancements. Changes include 126 modified, 147 new, and 49 removed functions, with 115,152 unchanged.

Power Consumption Changes:

  • build.bin.llama-tts: +0.246% (+884 nJ)
  • build.bin.llama-cvector-generator: +0.355% (+1,259 nJ)
  • All other binaries (libllama.so, libggml.so, libggml-base.so, libggml-cpu.so, libmtmd.so, llama-bench, llama-quantize, llama-tokenize, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-gemma3-cli, llama-qwen2vl-cli): 0% change

Core inference libraries show zero change, confirming modifications are isolated to speculative decoding orchestration.

Function Analysis

common_speculative_state_ngram_map_k::begin() (llama-tts, llama-cvector-generator): Response time increased from 7ns to ~98,000ns (+1,388,889%). This represents a critical correctness fix, not regression—the base version was a non-functional no-op causing cache pollution. Target version properly maintains n-gram cache. Called once per generation sequence, the ~98μs overhead is negligible when amortized across tokens.

common_ngram_map constructor (both binaries): Response time increased from 225ns to 2,424ns (+976%). Change pre-allocates 262,144-entry hash table (~1MB) for O(1) n-gram lookups instead of O(n) linear search. One-time 2.2μs initialization cost enables significant runtime performance gains.

common_speculative_begin() (both binaries): Response time increased from 488ns to 879ns (+80%). Added microsecond-precision timing instrumentation for performance profiling. The 391ns overhead enables critical observability for optimizing speculative decoding phases.

Multiple STL functions (std::vector::begin/end, std::unordered_set::begin, std::map::end) show 180-190ns increases (+200-300%) despite no source changes, indicating compiler optimization differences. These are in non-critical utility paths with negligible real-world impact.

Additional Findings

All GPU backends, matrix multiplication kernels, attention mechanisms, and quantization routines remain unchanged. The modifications optimize CPU-side speculative decoding orchestration without affecting GPU compute operations. The O(1) hash-based n-gram lookup optimization and correctness fixes justify the minimal initialization overhead (<0.01% of inference time).

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 18 times, most recently from 76ecbd8 to f91deaa Compare February 2, 2026 17:20
@loci-dev loci-dev force-pushed the main branch 15 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 073bd79 to 823244c Compare February 18, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments