Conversation
Performance Review Report: llama.cpp Speculative Decoding EnhancementExecutive SummaryCommit 7ef5b95 ("spec: add ngram-mod") introduces n-gram based speculative decoding with negligible performance impact on existing operations. Analysis of 11 function instances across llama-tts and llama-cvector-generator binaries reveals performance changes isolated to non-critical utility code, with core inference operations completely unchanged. Impact Classification: MinorFiles Changed: 12 modified, 39 added, 3 deleted Performance Changes:
Key FindingsNo Critical Path Impact: All performance-critical components (matrix multiplication, attention mechanisms, quantization kernels, GPU backends) remain unmodified. The new n-gram speculative decoding is opt-in and does not execute in default inference paths. Standard Library Regressions: Observed regressions occur exclusively in STL functions (vector iterators, allocators, tree accessors) with no source code changes. Performance variations stem from compiler optimization differences between builds, not algorithmic modifications:
Positive Changes: Template function lookup improved by 183 nanoseconds per call, providing 4-37 microseconds savings per template render. HTTP socket validation shows 256% throughput improvement, indicating better connection pooling. Energy Impact: Unable to quantify due to tool error, but expected to be unmeasurable given unchanged compute operations and nanosecond-scale differences in non-critical paths. ConclusionThe commit successfully adds valuable speculative decoding functionality without impacting inference performance. All overhead occurs in initialization code (microseconds at startup) or shows net improvements (template rendering, HTTP operations). No optimization required. See the complete breakdown in Version Insights |
8a94eed to
adaff32
Compare
7ef5b95 to
1644da7
Compare
|
No summary available at this time. Visit Loci Inspector to review detailed analysis. |
3bf1adc to
a316554
Compare
048ad94 to
6c1fde6
Compare
823244c to
bab7d39
Compare
9ea4a65 to
c001e9f
Compare
Mirrored from ggml-org/llama.cpp#19164
cont #18471
Add basic ngram hasher for speculative decoding:
ntokens and pick the next token from the storageSome characteristics:
mis not fixed)Currently, a single hash pool is shared across all server slots, so different requests can benefit from each other.
Sample usage:
Applications:
Example:
spec-mod-0.mov
TODO: