UPSTREAM PR #19164: spec : add ngram-mod by loci-dev · Pull Request #1063 · auroralabs-loci/llama.cpp

loci-dev · 2026-01-28T18:45:45Z

Mirrored from ggml-org/llama.cpp#19164

cont #18471

Add basic ngram hasher for speculative decoding:

For each ngram, compute a hash using LCG
For each computed hash, store the next token
During speculation, iteratively compute the rolling hash of the last n tokens and pick the next token from the storage

Some characteristics:

Lightweight (~20 MB)
Constant memory and complexity
Can generate variable draft lengths (i.e. m is not fixed)

Currently, a single hash pool is shared across all server slots, so different requests can benefit from each other.

Sample usage:

# notes:
# - small `n` are not recommended
# - MoEs require long drafts
# - dense models: can reduce `--draft-min` and `--draft-max`

llama-server ... --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 16 --draft-max 64

Applications:

Iterating over a block of text
Summarization

Example:

spec-mod-0.mov

TODO:

Reset criteria?

loci-review · 2026-01-28T20:27:03Z

Performance Review Report: llama.cpp Speculative Decoding Enhancement

Executive Summary

Commit 7ef5b95 ("spec: add ngram-mod") introduces n-gram based speculative decoding with negligible performance impact on existing operations. Analysis of 11 function instances across llama-tts and llama-cvector-generator binaries reveals performance changes isolated to non-critical utility code, with core inference operations completely unchanged.

Impact Classification: Minor

Files Changed: 12 modified, 39 added, 3 deleted

Performance Changes:

Core inference: 0 nanoseconds change (llama_decode, GEMM, attention, KV cache unchanged)
Initialization overhead: +2,000-5,000 nanoseconds total (one-time startup cost)
Template rendering: -183 nanoseconds per call (69.6% improvement in jinja function dispatch)
HTTP operations: +179 nanoseconds per call, but +256% throughput improvement

Key Findings

No Critical Path Impact: All performance-critical components (matrix multiplication, attention mechanisms, quantization kernels, GPU backends) remain unmodified. The new n-gram speculative decoding is opt-in and does not execute in default inference paths.

Standard Library Regressions: Observed regressions occur exclusively in STL functions (vector iterators, allocators, tree accessors) with no source code changes. Performance variations stem from compiler optimization differences between builds, not algorithmic modifications:

std::vector::begin(): +181 nanoseconds (initialization only)
std::allocator::deallocate(): +33 nanoseconds (initialization only)
std::chrono::__cast(): +190 nanoseconds (logging only)

Positive Changes: Template function lookup improved by 183 nanoseconds per call, providing 4-37 microseconds savings per template render. HTTP socket validation shows 256% throughput improvement, indicating better connection pooling.

Energy Impact: Unable to quantify due to tool error, but expected to be unmeasurable given unchanged compute operations and nanosecond-scale differences in non-critical paths.

Conclusion

The commit successfully adds valuable speculative decoding functionality without impacting inference performance. All overhead occurs in initialization code (microseconds at startup) or shows net improvements (template rendering, HTTP operations). No optimization required.

See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

loci-review · 2026-01-30T14:58:11Z

No summary available at this time. Visit Loci Inspector to review detailed analysis.

loci-dev temporarily deployed to PROD__AL_DEMO January 28, 2026 18:45 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 15 times, most recently from 8a94eed to adaff32 Compare January 30, 2026 07:22

spec : add ngram-mod

37298a6

loci-dev force-pushed the main branch from adaff32 to cd9899c Compare January 30, 2026 10:15

cont : simplify + keep track of occupancy

7f52bca

loci-dev force-pushed the main branch from cd9899c to 35011d6 Compare January 30, 2026 12:20

ggerganov added 3 commits January 30, 2026 14:31

cont : cleanup

57173c3

cont : move initialization to common/speculative

a9a076f

cont : cleanup

1644da7

loci-dev force-pushed the upstream-PR19164-branch_ggml-org-gg/spec-ngram-mod branch from 7ef5b95 to 1644da7 Compare January 30, 2026 13:48

loci-dev temporarily deployed to PROD__AL_DEMO January 30, 2026 13:48 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 3 times, most recently from 3bf1adc to a316554 Compare January 30, 2026 18:18

loci-dev force-pushed the main branch 16 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

loci-dev force-pushed the main branch 10 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17

loci-dev force-pushed the main branch 4 times, most recently from 9ea4a65 to c001e9f Compare February 22, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

UPSTREAM PR #19164: spec : add ngram-mod#1063

UPSTREAM PR #19164: spec : add ngram-mod#1063
loci-dev wants to merge 5 commits intomainfrom
upstream-PR19164-branch_ggml-org-gg/spec-ngram-mod

loci-dev commented Jan 28, 2026

Uh oh!

loci-review bot commented Jan 28, 2026

Uh oh!

loci-review bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

loci-dev commented Jan 28, 2026

Uh oh!

loci-review bot commented Jan 28, 2026

Performance Review Report: llama.cpp Speculative Decoding Enhancement

Executive Summary

Impact Classification: Minor

Key Findings

Conclusion

Uh oh!

loci-review bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants