UPSTREAM PR #19253: Feature/ngram map reasoning by loci-dev · Pull Request #1124 · auroralabs-loci/llama.cpp

loci-dev · 2026-02-01T20:42:05Z

Note

Source pull request: ggml-org/llama.cpp#19253

This pull request is a follow-up to #18471.

ngram-map: Remove outdated entries if reasoning of previous message has been removed.
ngram-map: An internal hash map which was missing in #18471 has been added.
statistics: Computation of t_begin and t_accept has been added.
docs/speculative.md: ngram-mod has been added (see #19164).

loci-review · 2026-02-01T22:02:50Z

Overview

Analysis of 115,474 functions across 15 binaries identified 126 modified, 147 new, and 49 removed functions. Changes focus on n-gram-based speculative decoding enhancements with minimal performance impact.

Power Consumption Changes:

build.bin.llama-tts: +0.246% (+885 nJ)
build.bin.llama-cvector-generator: +0.355% (+1,258 nJ)
All other binaries (build.bin.libmtmd.so, build.bin.libllama.so, build.bin.llama-tokenize, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-bench, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so): 0.0% change

Function Analysis

common_speculative_state_ngram_map_k::begin() (llama-tts, llama-cvector-generator): Response time increased from 7ns to ~98,000ns (+1,388,899%), throughput time from 7ns to 16ns (+133%). This is a correctness fix—the base version was a no-op (GGML_UNUSED(prompt)), causing stale n-gram data contamination. The target version properly calls common_ngram_map_begin() to initialize maps and clean outdated entries. The 98µs overhead occurs once per generation sequence, not per token.

common_ngram_map constructor (both binaries): Response time increased from 225ns to ~2,424ns (+977%), throughput time from 137ns to 176ns (+29%). Added hash map allocation (262,144 entries, ~1MB) via key_map.resize() to enable O(1) lookups instead of O(n) linear search. One-time initialization cost enables faster runtime performance.

common_speculative_begin() (both binaries): Response time increased from ~488ns to ~879ns (+80%), throughput time from 151ns to 225ns (+48%). Added microsecond-precision timing instrumentation (ggml_time_us()) for profiling initialization, draft generation, and acceptance phases. Provides essential observability for optimizing speculative decoding.

Standard library functions (std::_Rb_tree::end(), std::vector::begin/end(), std::_Hashtable::begin()) show 180-187ns increases (+214-310%) with no source code changes—compiler artifacts affecting non-critical initialization paths. Other analyzed functions showed negligible changes.

Additional Findings

All GPU backend libraries and core inference operations (GEMM, attention, quantization) remain unchanged. The modifications affect initialization only, not the per-token inference hot path. Changes enable 2-3x inference speedup through speculative decoding while introducing <0.4% power consumption increase. Five commits by Sascha Rogmann delivered correctness fixes, performance instrumentation, and hash map optimizations with well-justified trade-offs.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

loci-review · 2026-02-01T23:04:29Z

Overview

Analysis of 115,474 functions across 14 binaries reveals minor positive impact from speculative decoding enhancements. Changes include 126 modified, 147 new, and 49 removed functions, with 115,152 unchanged.

Power Consumption Changes:

build.bin.llama-tts: +0.246% (+884 nJ)
build.bin.llama-cvector-generator: +0.355% (+1,259 nJ)
All other binaries (libllama.so, libggml.so, libggml-base.so, libggml-cpu.so, libmtmd.so, llama-bench, llama-quantize, llama-tokenize, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-gemma3-cli, llama-qwen2vl-cli): 0% change

Core inference libraries show zero change, confirming modifications are isolated to speculative decoding orchestration.

Function Analysis

common_speculative_state_ngram_map_k::begin() (llama-tts, llama-cvector-generator): Response time increased from 7ns to ~98,000ns (+1,388,889%). This represents a critical correctness fix, not regression—the base version was a non-functional no-op causing cache pollution. Target version properly maintains n-gram cache. Called once per generation sequence, the ~98μs overhead is negligible when amortized across tokens.

common_ngram_map constructor (both binaries): Response time increased from 225ns to 2,424ns (+976%). Change pre-allocates 262,144-entry hash table (~1MB) for O(1) n-gram lookups instead of O(n) linear search. One-time 2.2μs initialization cost enables significant runtime performance gains.

common_speculative_begin() (both binaries): Response time increased from 488ns to 879ns (+80%). Added microsecond-precision timing instrumentation for performance profiling. The 391ns overhead enables critical observability for optimizing speculative decoding phases.

Multiple STL functions (std::vector::begin/end, std::unordered_set::begin, std::map::end) show 180-190ns increases (+200-300%) despite no source changes, indicating compiler optimization differences. These are in non-critical utility paths with negligible real-world impact.

Additional Findings

All GPU backends, matrix multiplication kernels, attention mechanisms, and quantization routines remain unchanged. The modifications optimize CPU-side speculative decoding orchestration without affecting GPU compute operations. The O(1) hash-based n-gram lookup optimization and correctness fixes justify the minimal initialization overhead (<0.01% of inference time).

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

srogmann added 5 commits January 31, 2026 20:21

spec: ngram-map and reasoning chats

caec12a

spec: add t_begin and t_accept

a215937

ngram-map : add internal hash map

2c26f3c

docs : update ngram-map, add ngram-mod

a71f4a0

docs : fix ngram-map-k

067318e

loci-dev temporarily deployed to PROD__AL_DEMO February 1, 2026 20:42 — with GitHub Actions Inactive

docs : differences between implementations

c7f261a

loci-dev force-pushed the main branch from 45e9971 to daf6708 Compare February 1, 2026 21:10

loci-dev temporarily deployed to PROD__AL_DEMO February 1, 2026 21:38 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from daf6708 to 32f0733 Compare February 1, 2026 22:09

loci-dev force-pushed the main branch 18 times, most recently from 76ecbd8 to f91deaa Compare February 2, 2026 17:20

loci-dev force-pushed the main branch 15 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

loci-dev force-pushed the main branch 9 times, most recently from 073bd79 to 823244c Compare February 18, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19253: Feature/ngram map reasoning#1124

UPSTREAM PR #19253: Feature/ngram map reasoning#1124
loci-dev wants to merge 6 commits intomainfrom
loci/pr-19253-feature-ngram-map-reasoning

loci-dev commented Feb 1, 2026

Uh oh!

loci-review bot commented Feb 1, 2026

Uh oh!

loci-review bot commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

loci-dev commented Feb 1, 2026

Uh oh!

loci-review bot commented Feb 1, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Feb 1, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments