Skip to content

Comments

UPSTREAM PR #19164: spec : add ngram-mod#1063

Open
loci-dev wants to merge 5 commits intomainfrom
upstream-PR19164-branch_ggml-org-gg/spec-ngram-mod
Open

UPSTREAM PR #19164: spec : add ngram-mod#1063
loci-dev wants to merge 5 commits intomainfrom
upstream-PR19164-branch_ggml-org-gg/spec-ngram-mod

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#19164

cont #18471

Add basic ngram hasher for speculative decoding:

  • For each ngram, compute a hash using LCG
  • For each computed hash, store the next token
  • During speculation, iteratively compute the rolling hash of the last n tokens and pick the next token from the storage

Some characteristics:

  • Lightweight (~20 MB)
  • Constant memory and complexity
  • Can generate variable draft lengths (i.e. m is not fixed)

Currently, a single hash pool is shared across all server slots, so different requests can benefit from each other.

Sample usage:

# notes:
# - small `n` are not recommended
# - MoEs require long drafts
# - dense models: can reduce `--draft-min` and `--draft-max`

llama-server ... --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 16 --draft-max 64

Applications:

  • Iterating over a block of text
  • Summarization

Example:

spec-mod-0.mov

TODO:

  • Reset criteria?

@loci-review
Copy link

loci-review bot commented Jan 28, 2026

Performance Review Report: llama.cpp Speculative Decoding Enhancement

Executive Summary

Commit 7ef5b95 ("spec: add ngram-mod") introduces n-gram based speculative decoding with negligible performance impact on existing operations. Analysis of 11 function instances across llama-tts and llama-cvector-generator binaries reveals performance changes isolated to non-critical utility code, with core inference operations completely unchanged.

Impact Classification: Minor

Files Changed: 12 modified, 39 added, 3 deleted

Performance Changes:

  • Core inference: 0 nanoseconds change (llama_decode, GEMM, attention, KV cache unchanged)
  • Initialization overhead: +2,000-5,000 nanoseconds total (one-time startup cost)
  • Template rendering: -183 nanoseconds per call (69.6% improvement in jinja function dispatch)
  • HTTP operations: +179 nanoseconds per call, but +256% throughput improvement

Key Findings

No Critical Path Impact: All performance-critical components (matrix multiplication, attention mechanisms, quantization kernels, GPU backends) remain unmodified. The new n-gram speculative decoding is opt-in and does not execute in default inference paths.

Standard Library Regressions: Observed regressions occur exclusively in STL functions (vector iterators, allocators, tree accessors) with no source code changes. Performance variations stem from compiler optimization differences between builds, not algorithmic modifications:

  • std::vector::begin(): +181 nanoseconds (initialization only)
  • std::allocator::deallocate(): +33 nanoseconds (initialization only)
  • std::chrono::__cast(): +190 nanoseconds (logging only)

Positive Changes: Template function lookup improved by 183 nanoseconds per call, providing 4-37 microseconds savings per template render. HTTP socket validation shows 256% throughput improvement, indicating better connection pooling.

Energy Impact: Unable to quantify due to tool error, but expected to be unmeasurable given unchanged compute operations and nanosecond-scale differences in non-critical paths.

Conclusion

The commit successfully adds valuable speculative decoding functionality without impacting inference performance. All overhead occurs in initialization code (microseconds at startup) or shows net improvements (template rendering, HTTP operations). No optimization required.

See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

@loci-dev loci-dev force-pushed the main branch 15 times, most recently from 8a94eed to adaff32 Compare January 30, 2026 07:22
@loci-dev loci-dev force-pushed the upstream-PR19164-branch_ggml-org-gg/spec-ngram-mod branch from 7ef5b95 to 1644da7 Compare January 30, 2026 13:48
@loci-review
Copy link

loci-review bot commented Jan 30, 2026

No summary available at this time. Visit Loci Inspector to review detailed analysis.

@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 3bf1adc to a316554 Compare January 30, 2026 18:18
@loci-dev loci-dev force-pushed the main branch 16 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 4 times, most recently from 9ea4a65 to c001e9f Compare February 22, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants