UPSTREAM PR #19253: Feature/ngram map reasoning#1124
Conversation
OverviewAnalysis of 115,474 functions across 15 binaries identified 126 modified, 147 new, and 49 removed functions. Changes focus on n-gram-based speculative decoding enhancements with minimal performance impact. Power Consumption Changes:
Function Analysiscommon_speculative_state_ngram_map_k::begin() (llama-tts, llama-cvector-generator): Response time increased from 7ns to ~98,000ns (+1,388,899%), throughput time from 7ns to 16ns (+133%). This is a correctness fix—the base version was a no-op ( common_ngram_map constructor (both binaries): Response time increased from 225ns to ~2,424ns (+977%), throughput time from 137ns to 176ns (+29%). Added hash map allocation (262,144 entries, ~1MB) via common_speculative_begin() (both binaries): Response time increased from ~488ns to ~879ns (+80%), throughput time from 151ns to 225ns (+48%). Added microsecond-precision timing instrumentation ( Standard library functions (std::_Rb_tree::end(), std::vector::begin/end(), std::_Hashtable::begin()) show 180-187ns increases (+214-310%) with no source code changes—compiler artifacts affecting non-critical initialization paths. Other analyzed functions showed negligible changes. Additional FindingsAll GPU backend libraries and core inference operations (GEMM, attention, quantization) remain unchanged. The modifications affect initialization only, not the per-token inference hot path. Changes enable 2-3x inference speedup through speculative decoding while introducing <0.4% power consumption increase. Five commits by Sascha Rogmann delivered correctness fixes, performance instrumentation, and hash map optimizations with well-justified trade-offs. 🔎 Full breakdown: Loci Inspector. |
OverviewAnalysis of 115,474 functions across 14 binaries reveals minor positive impact from speculative decoding enhancements. Changes include 126 modified, 147 new, and 49 removed functions, with 115,152 unchanged. Power Consumption Changes:
Core inference libraries show zero change, confirming modifications are isolated to speculative decoding orchestration. Function Analysiscommon_speculative_state_ngram_map_k::begin() (llama-tts, llama-cvector-generator): Response time increased from 7ns to ~98,000ns (+1,388,889%). This represents a critical correctness fix, not regression—the base version was a non-functional no-op causing cache pollution. Target version properly maintains n-gram cache. Called once per generation sequence, the ~98μs overhead is negligible when amortized across tokens. common_ngram_map constructor (both binaries): Response time increased from 225ns to 2,424ns (+976%). Change pre-allocates 262,144-entry hash table (~1MB) for O(1) n-gram lookups instead of O(n) linear search. One-time 2.2μs initialization cost enables significant runtime performance gains. common_speculative_begin() (both binaries): Response time increased from 488ns to 879ns (+80%). Added microsecond-precision timing instrumentation for performance profiling. The 391ns overhead enables critical observability for optimizing speculative decoding phases. Multiple STL functions (std::vector::begin/end, std::unordered_set::begin, std::map::end) show 180-190ns increases (+200-300%) despite no source changes, indicating compiler optimization differences. These are in non-critical utility paths with negligible real-world impact. Additional FindingsAll GPU backends, matrix multiplication kernels, attention mechanisms, and quantization routines remain unchanged. The modifications optimize CPU-side speculative decoding orchestration without affecting GPU compute operations. The O(1) hash-based n-gram lookup optimization and correctness fixes justify the minimal initialization overhead (<0.01% of inference time). 🔎 Full breakdown: Loci Inspector. |
76ecbd8 to
f91deaa
Compare
048ad94 to
6c1fde6
Compare
073bd79 to
823244c
Compare
Note
Source pull request: ggml-org/llama.cpp#19253
This pull request is a follow-up to #18471.