spec : add self speculative decoding, ngram and refactor#1261
spec : add self speculative decoding, ngram and refactor#1261
Conversation
common : use common_ prefix for common library function llama : use LLAMA_TOKEN_NULL spec : add self speculative decoding (no draft model required) + refactor spec : add ngram-mod spec : various improvements ton ngram-map + docs spec : fix the check-rate logic of ngram-simple common : add common_speculative_is_compat() spec : simplify time measurement using common_time_meas refactor common_sampler_init refactor common_token_to_piece refactor and fix cur_p bug clean up
|
Can we have this in server? Otherwise it's hard to test for me. I'm on a phone, away from home, I can only access my PC via ssh. error: unknown argument: --spec-type |
|
I don't have this issue. |
|
Yeah sorry I built the wrong thing. Would it be possible to make it work with VL models too? |
|
I was able to build this branch I'm not sure the best way to tweak: fwiw it still seems to be working okay with |
It's best to create a feature request in mainline. This is beyond my ability. |
|
Did some testing with GLM-4.7.. Baseline without speculative decoding: With self speculative decoding ( With GLM-4.5 DRAFT model from jukofyork ( |
It probably depends a lot on the task and how much repeated output there is - the new method requires large chunks of the preceding context gets repeated to work well. Also linking this post I just saw in case anyone is using with DOS/Windows Has anyone tested it with a model using MLA? I couldn't get much of a boost in |
I see. I am actually quite happy with the 13% TG boost already 😄 |
|
I'm very happy with this, seeing 10-30% tg boost at high context (45k+). I expected it to be good for programming only, but as it turns out with enough context it's perfectly fine for QnA and conversations. I use this: --draft-min 1 --spec-ngram-size-n 8 For my use case (qwen3vl 235B, high RAM offload) it fundamentally denies the tg loss that comes with long context. Anyway: This makes it sound like it's how far back in the context window the program is allowed to search (ex you have 40k context but you only want to search on the last 20k), but I imagine this is just the minimum sequence length that's required to match for the speculation to start; aka if the last spec-ngram-size-n generated tokens happen to match an already generated sequence in the past, speculate that the next draft-max tokens are going to be the same as the continuation of that sequence. |
I advise against --spec-type ngram-mod. While it's the one that sounds the best it's also the one that gave me the worst results. Also it seems that self speculative decoding works better the more conversational context you have, as it has more examples of past sequences to work with. There it looked like you have only 5k context, and especially if it's not past messages but rather a system prompt then results are bound to be poor. |
Did you use -sm layer or -sm graph for the GLM 4.7 main model? |
|
Looks like it dumps when the context shrinks otherwise about 20 percent increase normally get 16 tps ./build/bin/llama-server
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="128799779225600" timestamp=1770858031 id_slot=0 id_task=3251 p0=38153 |
|
There's a bug:
Happens every time there a character swap, maybe whenever context is discarded? Also, this will happen with -spec-type ngram-map-k, but not with -spec-type ngram-simple. |
Neither. This is a Strix Halo + 1 RTX 3090 eGPU setup, and I only use the CPU of the Strix Halo. It can fit a 3.2 bpw quant of GLM 4.7. |
|
Something is wrong with mainline's ngram-map-k implementation as I had the same crash testing the mainline, but interestingly, ngram-map-k4v works fine. |
|
Is it normal, for the same seed, to have each generation be different? This breaks reproducibility for me. Am I doing something wrong? Model: GLM 4.7 IQ4_k from ubergarm. Relevant params: If I disable (not use) spec-decoding I get the same results (for the same seed) each generation. |
Closes #1197
This PR ports self speculative decoding and ngram related PR from mainline up to ggml-org/llama.cpp#19377.
Details are mostly in : ggml-org/llama.cpp#18471 and ggml-org/llama.cpp#19164
Add a new sampler type called
DISTused in speculative decoding to get the probability of logits after the sampling. Mainline returns probability after any samplers, but it's not the case here.DISTis only used for draft model.The other changes are just refactoring existing functions to make future port easier.
Since I don't find much information about ngram based speculative decoding, the following information is generated by AI and for reference only. I will correct them if you find any mistakes.
The parameters for self speculative decoding
--spec-ngram-size-nWhat it does: Looks backward from the current position and takes the last N tokens. It then searches the history for any earlier place where the same pattern of N tokens appeared.
Higher values: Look for longer patterns, potentially better matches but slower
Lower values: Faster lookup but might miss longer patterns
--spec-ngram-size-mWhat it does: How many tokens to predict/draft ahead once a pattern is found
Higher values: Draft more tokens at once (faster if correct)
Lower values: Draft fewer tokens (more accurate but slower)
--spec-ngram-min-hitsWhat it does: Minimum number of times a pattern must appear in context before using it for prediction
Higher values: More conservative, only use reliable patterns
Lower values: More aggressive, use patterns even if seen few times
The main n-gram-based self-speculative decoding strategies (
--spec-type)ngram-simple
Performs a linear (brute-force) scan through the entire current context (prompt + tokens generated so far).
Searches for an exact match of the last N tokens (the lookup key / n-gram).
If a match is found anywhere in the history, it copies the following up to M tokens as the proposed draft sequence.
Uses no additional data structure — just simple pattern matching with no indexing or frequency tracking.
Lightweight in memory, but can become slow on very long contexts due to repeated full scans.
Tends to have lower acceptance rate because it accepts any single occurrence — even rare or unrepresentative ones.
Best for: very long contexts with near-exact repetitions (e.g., repetitive logs, boilerplate text, large repeated documents).
ngram-map-k
Builds an internal hash-map (keyed by n-gram hashes) that indexes all previous n-grams appearing in the current context window only.
Looks up the last N tokens in this per-context map.
If found, selects a continuation (up to M tokens) — usually the most frequent one (or from top-k if multiple exist).
Lookups are more efficient than linear scanning once the map is populated.
Considers frequency or multiple occurrences within the same conversation/context → generally higher-quality drafts than
ngram-simple.The map is reset / rebuilt for each new independent context (no persistence across sessions).
Best for: moderately repetitive but not perfectly identical patterns (e.g., code editing, structured JSON/YAML, chat with recurring phrases/styles).
Usually offers the best balance of hit rate and acceptance for single long generations.
ngram-map-k4v (experimental variant of map-k)
Similar to
ngram-map-k, but stores up to four different continuations (m-grams / values) for each n-gram key in the hash-map.Looks for the current n-gram of size N (the key) in the token history.
When a match occurs, it drafts the most frequent continuation among the stored options (or picks intelligently from the up-to-4 values).
Allows better handling of ambiguous / branching patterns where the same prefix has appeared before but led to multiple reasonable next sequences.
Still per-context (map resets per session / generation like
ngram-map-k), but tracks richer statistics per key.Often more accurate / higher acceptance in workloads with longer or varied repetitions (e.g., code with similar blocks, JSON schemas with optional fields, logs with slight variations).
Best for: cases where
ngram-map-kfeels too conservative or misses good continuations due to single-value storage — try this when you see many partial matches but want longer accepted drafts.ngram-mod
Uses a persistent shared hash-pool (global map) that lives across all server slots / sessions in
llama-server(not reset per request).The map stores n-gram hash → single next token (Markov-like, not full m-gram continuations).
During speculation, it iteratively computes a rolling hash of recent tokens and chains single-token predictions to build a variable-length draft.
Draft length is not fixed to M — it depends on how strong/long the chain is in the persistent pool.
Most conservative / highest-quality per token because predictions are aggregated from real usage across many generations (improves with warmup).
Memory footprint is small and roughly constant (~16 MB in reported implementations).
Best for: multi-turn chats, agents, repeated coding styles, long sessions, or
llama-serverwith many similar users / requests — shines after the pool has warmed up.May start weaker (fewer drafts proposed) until enough data accumulates in the pool.
Single-token chaining can sometimes limit very long exact-sequence matches compared to full m-gram copy approaches.
llama-server [...] --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64Realistic recommendation
ngram-map-k4vMost people’s current favorite for one-shot / single long generations
→ especially good on code, JSON, markdown, logs, templates with some variation
ngram-modFrequently becomes the strongest option when running
llama-serverfor:multi-turn conversations
agents / tool-calling loops
repeated coding / writing style
shared server with similar users
(needs some generations to warm up the pool)
ngram-map-kSolid safe choice when k4v is not available or feels too aggressive
ngram-simpleUsually only used when the map-based variants are surprisingly slow (rare)
or when context has extremely precise, long identical repeats
Acceptance rate quick check (printed in logs):
excellent: try longer
--spec-ngram-size-m/--draft-maxgood: good balanced regime
low: reduce aggression or switch to
--spec-type none