Skip to content

spec : add self speculative decoding, ngram and refactor#1261

Merged
ikawrakow merged 3 commits intomainfrom
fcp/spec_self
Feb 13, 2026
Merged

spec : add self speculative decoding, ngram and refactor#1261
ikawrakow merged 3 commits intomainfrom
fcp/spec_self

Conversation

@firecoperana
Copy link
Collaborator

@firecoperana firecoperana commented Feb 10, 2026

Closes #1197

This PR ports self speculative decoding and ngram related PR from mainline up to ggml-org/llama.cpp#19377.
Details are mostly in : ggml-org/llama.cpp#18471 and ggml-org/llama.cpp#19164

Add a new sampler type called DIST used in speculative decoding to get the probability of logits after the sampling. Mainline returns probability after any samplers, but it's not the case here. DIST is only used for draft model.

The other changes are just refactoring existing functions to make future port easier.

Since I don't find much information about ngram based speculative decoding, the following information is generated by AI and for reference only. I will correct them if you find any mistakes.

The parameters for self speculative decoding

  1. --spec-ngram-size-n
    What it does: Looks backward from the current position and takes the last N tokens. It then searches the history for any earlier place where the same pattern of N tokens appeared.
    Higher values: Look for longer patterns, potentially better matches but slower
    Lower values: Faster lookup but might miss longer patterns

  2. --spec-ngram-size-m
    What it does: How many tokens to predict/draft ahead once a pattern is found
    Higher values: Draft more tokens at once (faster if correct)
    Lower values: Draft fewer tokens (more accurate but slower)

  3. --spec-ngram-min-hits
    What it does: Minimum number of times a pattern must appear in context before using it for prediction
    Higher values: More conservative, only use reliable patterns
    Lower values: More aggressive, use patterns even if seen few times

The main n-gram-based self-speculative decoding strategies (--spec-type)

ngram-simple

Performs a linear (brute-force) scan through the entire current context (prompt + tokens generated so far).
Searches for an exact match of the last N tokens (the lookup key / n-gram).
If a match is found anywhere in the history, it copies the following up to M tokens as the proposed draft sequence.
Uses no additional data structure — just simple pattern matching with no indexing or frequency tracking.
Lightweight in memory, but can become slow on very long contexts due to repeated full scans.
Tends to have lower acceptance rate because it accepts any single occurrence — even rare or unrepresentative ones.
Best for: very long contexts with near-exact repetitions (e.g., repetitive logs, boilerplate text, large repeated documents).

ngram-map-k

Builds an internal hash-map (keyed by n-gram hashes) that indexes all previous n-grams appearing in the current context window only.
Looks up the last N tokens in this per-context map.
If found, selects a continuation (up to M tokens) — usually the most frequent one (or from top-k if multiple exist).
Lookups are more efficient than linear scanning once the map is populated.
Considers frequency or multiple occurrences within the same conversation/context → generally higher-quality drafts than ngram-simple.
The map is reset / rebuilt for each new independent context (no persistence across sessions).
Best for: moderately repetitive but not perfectly identical patterns (e.g., code editing, structured JSON/YAML, chat with recurring phrases/styles).
Usually offers the best balance of hit rate and acceptance for single long generations.

ngram-map-k4v (experimental variant of map-k)

Similar to ngram-map-k, but stores up to four different continuations (m-grams / values) for each n-gram key in the hash-map.
Looks for the current n-gram of size N (the key) in the token history.
When a match occurs, it drafts the most frequent continuation among the stored options (or picks intelligently from the up-to-4 values).
Allows better handling of ambiguous / branching patterns where the same prefix has appeared before but led to multiple reasonable next sequences.
Still per-context (map resets per session / generation like ngram-map-k), but tracks richer statistics per key.
Often more accurate / higher acceptance in workloads with longer or varied repetitions (e.g., code with similar blocks, JSON schemas with optional fields, logs with slight variations).
Best for: cases where ngram-map-k feels too conservative or misses good continuations due to single-value storage — try this when you see many partial matches but want longer accepted drafts.

ngram-mod

Uses a persistent shared hash-pool (global map) that lives across all server slots / sessions in llama-server (not reset per request).
The map stores n-gram hash → single next token (Markov-like, not full m-gram continuations).
During speculation, it iteratively computes a rolling hash of recent tokens and chains single-token predictions to build a variable-length draft.
Draft length is not fixed to M — it depends on how strong/long the chain is in the persistent pool.
Most conservative / highest-quality per token because predictions are aggregated from real usage across many generations (improves with warmup).
Memory footprint is small and roughly constant (~16 MB in reported implementations).
Best for: multi-turn chats, agents, repeated coding styles, long sessions, or llama-server with many similar users / requests — shines after the pool has warmed up.
May start weaker (fewer drafts proposed) until enough data accumulates in the pool.
Single-token chaining can sometimes limit very long exact-sequence matches compared to full m-gram copy approaches.

  • Example usage from docs (good for longer repetitions):

llama-server [...] --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64

Realistic recommendation

  1. ngram-map-k4v
    Most people’s current favorite for one-shot / single long generations
    → especially good on code, JSON, markdown, logs, templates with some variation

  2. ngram-mod
    Frequently becomes the strongest option when running llama-server for:
    multi-turn conversations
    agents / tool-calling loops
    repeated coding / writing style
    shared server with similar users
    (needs some generations to warm up the pool)

  3. ngram-map-k
    Solid safe choice when k4v is not available or feels too aggressive

  4. ngram-simple
    Usually only used when the map-based variants are surprisingly slow (rare)
    or when context has extremely precise, long identical repeats

Acceptance rate quick check (printed in logs):
excellent: try longer --spec-ngram-size-m / --draft-max
good: good balanced regime
low: reduce aggression or switch to --spec-type none

firecoperana and others added 2 commits February 9, 2026 18:23
common : use common_ prefix for common library function

llama : use LLAMA_TOKEN_NULL

spec : add self speculative decoding (no draft model required) + refactor

spec : add ngram-mod

spec : various improvements ton ngram-map + docs

spec : fix the check-rate logic of ngram-simple

common : add common_speculative_is_compat()

spec : simplify time measurement using common_time_meas

refactor common_sampler_init

refactor common_token_to_piece

refactor and fix cur_p bug

clean up
@MrHills-rs
Copy link

Can we have this in server? Otherwise it's hard to test for me. I'm on a phone, away from home, I can only access my PC via ssh.

error: unknown argument: --spec-type
usage: build/bin/llama-server [options]

@firecoperana
Copy link
Collaborator Author

I don't have this issue.

@MrHills-rs
Copy link

Yeah sorry I built the wrong thing.
Yeah, it works very well! But, it doesn't work when using a VL model with a loaded mmproj file.

Would it be possible to make it work with VL models too?

@ubergarm
Copy link
Contributor

ubergarm commented Feb 10, 2026

I was able to build this branch fcp/spec_self@38f61029 and running Step-Fun-3.5-Flash on 2x RTX A6000's (96 GB VRAM total) like so:

model=ubergarm/Step-3.5-Flash-GGUF/Step-3.5-Flash-smol-IQ3_KS-00001-of-00003.gguf
CUDA_VISIBLE_DEVICES="0,1" \
./build/bin/llama-server \
  --model "$model" \
  --alias ubergarm/Step-Fun-3.5-Flash \
  -c 121072 \
  -khad -ctk q6_0 -ctv q8_0 \
  -ger \
  -sm graph \
  -ngl 99 \
  -ub 4096 -b 4096 \
  -ts 99,100 \
  --threads 1 \
  --host 127.0.0.1 \
  --port 8080 \
  --jinja \
  --no-mmap \
  --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 \
  --validate-quants

slot print_timing: id  0 | task -1 |
prompt eval time =    2316.77 ms /  4914 tokens (    0.47 ms per token,  2121.05 tokens per second)
       eval time =    3321.06 ms /   161 tokens (   20.63 ms per token,    48.48 tokens per second)
      total time =    5637.83 ms /  5075 tokens
statistics ngram_mod: #calls(b,g,a) = 91 28639 66, #gen drafts = 79, #acc drafts = 66, #gen tokens = 4971, #acc tokens = 14
77, dur(b,g,a) = 228.436, 23.177, 6.046 ms

I'm not sure the best way to tweak: --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 \ to improve chances of faster TG for some repetitive workloads.

fwiw it still seems to be working okay with opencode.

@firecoperana
Copy link
Collaborator Author

firecoperana commented Feb 10, 2026

Yeah sorry I built the wrong thing. Yeah, it works very well! But, it doesn't work when using a VL model with a loaded mmproj file.

Would it be possible to make it work with VL models too?

It's best to create a feature request in mainline. This is beyond my ability.

@sayap
Copy link
Contributor

sayap commented Feb 11, 2026

Did some testing with GLM-4.7..

Baseline without speculative decoding:

slot print_timing: id  0 | task -1 |
prompt eval time =   45378.11 ms /  5210 tokens (    8.71 ms per token,   114.81 tokens per second)
       eval time =   53183.46 ms /   545 tokens (   97.58 ms per token,    10.25 tokens per second)
      total time =   98561.57 ms /  5755 tokens

With self speculative decoding (--spec-type ngram-mod --draft-p-min 0.5):

begin: ngram_mod occupancy = 2926/4194304 (0.00)
accept: low acceptance streak (3) – resetting ngram_mod
slot print_timing: id  0 | task -1 |
prompt eval time =   45379.94 ms /  5210 tokens (    8.71 ms per token,   114.81 tokens per second)
       eval time =   46855.81 ms /   545 tokens (   85.97 ms per token,    11.63 tokens per second)
      total time =   92235.75 ms /  5755 tokens
draft acceptance rate = 0.49000 (  196 accepted /   400 generated)
statistics ngram_mod: #calls(b,g,a) = 1 348 21, #gen drafts = 25, #acc drafts = 21, #gen tokens = 400, #acc tokens = 196, dur(b,g,a) = 0.246, 0.580, 0.498 ms

With GLM-4.5 DRAFT model from jukofyork (-md GLM-4.5-DRAFT-0.6B-32k-Q4_0.gguf -ngld 999 --draft-p-min 0.5):

slot print_timing: id  0 | task -1 |
prompt eval time =   45439.06 ms /  5210 tokens (    8.72 ms per token,   114.66 tokens per second)
       eval time =   33312.28 ms /   545 tokens (   61.12 ms per token,    16.36 tokens per second)
      total time =   78751.34 ms /  5755 tokens
draft acceptance rate = 0.60417 (  493 accepted /   816 generated)
statistics draft: #calls(b,g,a) = 1 51 46, #gen drafts = 51, #acc drafts = 46, #gen tokens = 816, #acc tokens = 493, dur(b,g,a) = 0.000, 2293.628, 0.015 ms

@jukofyork
Copy link

jukofyork commented Feb 11, 2026

Did some testing with GLM-4.7..

Baseline without speculative decoding:

slot print_timing: id  0 | task -1 |
prompt eval time =   45378.11 ms /  5210 tokens (    8.71 ms per token,   114.81 tokens per second)
       eval time =   53183.46 ms /   545 tokens (   97.58 ms per token,    10.25 tokens per second)
      total time =   98561.57 ms /  5755 tokens

With self speculative decoding (--spec-type ngram-mod --draft-p-min 0.5):

begin: ngram_mod occupancy = 2926/4194304 (0.00)
accept: low acceptance streak (3) – resetting ngram_mod
slot print_timing: id  0 | task -1 |
prompt eval time =   45379.94 ms /  5210 tokens (    8.71 ms per token,   114.81 tokens per second)
       eval time =   46855.81 ms /   545 tokens (   85.97 ms per token,    11.63 tokens per second)
      total time =   92235.75 ms /  5755 tokens
draft acceptance rate = 0.49000 (  196 accepted /   400 generated)
statistics ngram_mod: #calls(b,g,a) = 1 348 21, #gen drafts = 25, #acc drafts = 21, #gen tokens = 400, #acc tokens = 196, dur(b,g,a) = 0.246, 0.580, 0.498 ms

With GLM-4.5 DRAFT model from jukofyork (-md GLM-4.5-DRAFT-0.6B-32k-Q4_0.gguf -ngld 999 --draft-p-min 0.5):

slot print_timing: id  0 | task -1 |
prompt eval time =   45439.06 ms /  5210 tokens (    8.72 ms per token,   114.66 tokens per second)
       eval time =   33312.28 ms /   545 tokens (   61.12 ms per token,    16.36 tokens per second)
      total time =   78751.34 ms /  5755 tokens
draft acceptance rate = 0.60417 (  493 accepted /   816 generated)
statistics draft: #calls(b,g,a) = 1 51 46, #gen drafts = 51, #acc drafts = 46, #gen tokens = 816, #acc tokens = 493, dur(b,g,a) = 0.000, 2293.628, 0.015 ms

It probably depends a lot on the task and how much repeated output there is - the new method requires large chunks of the preceding context gets repeated to work well.

Also linking this post I just saw in case anyone is using with DOS/Windows \r\n style newlines:

https://old.reddit.com/r/LocalLLaMA/comments/1r1k5gn/psa_on_llamacpp_spectype_ngrammod_use_lf_not_crlf/


Has anyone tested it with a model using MLA? I couldn't get much of a boost in ik_llama.fpp using my tiny Kimi-K2 model with -mla 3, but for some reason -mla 1 worked better (which looking through the source shouldn't be the case as ik_llama.cpp should be using the same branch as -mla 1 for small batches?).

@sayap
Copy link
Contributor

sayap commented Feb 11, 2026

It probably depends a lot on the task and how much repeated output there is - the new method requires large chunks of the preceding context gets repeated to work well.

I see. I am actually quite happy with the 13% TG boost already 😄

@MrHills-rs
Copy link

MrHills-rs commented Feb 11, 2026

I'm very happy with this, seeing 10-30% tg boost at high context (45k+). I expected it to be good for programming only, but as it turns out with enough context it's perfectly fine for QnA and conversations. I use this:

--draft-min 1 --spec-ngram-size-n 8
--draft-max 4 --spec-type ngram-map-k --draft-p-min 0.2 \

For my use case (qwen3vl 235B, high RAM offload) it fundamentally denies the tg loss that comes with long context.

Anyway:

--spec-ngram-size-n
What it does: How far back to look in the text to find patterns (context window for pattern matching)
Higher values: Look for longer patterns, potentially better matches but slower
Lower values: Faster lookup but might miss longer patterns

This makes it sound like it's how far back in the context window the program is allowed to search (ex you have 40k context but you only want to search on the last 20k), but I imagine this is just the minimum sequence length that's required to match for the speculation to start; aka if the last spec-ngram-size-n generated tokens happen to match an already generated sequence in the past, speculate that the next draft-max tokens are going to be the same as the continuation of that sequence.
If you set that spec-ngram-size-n as something really big, like 20000, the self speculation probably never gets to work.
Correct me if I'm wrong.

@MrHills-rs
Copy link

Did some testing with GLM-4.7..

Baseline without speculative decoding:

slot print_timing: id  0 | task -1 |
prompt eval time =   45378.11 ms /  5210 tokens (    8.71 ms per token,   114.81 tokens per second)
       eval time =   53183.46 ms /   545 tokens (   97.58 ms per token,    10.25 tokens per second)
      total time =   98561.57 ms /  5755 tokens

With self speculative decoding (--spec-type ngram-mod --draft-p-min 0.5):

begin: ngram_mod occupancy = 2926/4194304 (0.00)
accept: low acceptance streak (3) – resetting ngram_mod
slot print_timing: id  0 | task -1 |
prompt eval time =   45379.94 ms /  5210 tokens (    8.71 ms per token,   114.81 tokens per second)
       eval time =   46855.81 ms /   545 tokens (   85.97 ms per token,    11.63 tokens per second)
      total time =   92235.75 ms /  5755 tokens
draft acceptance rate = 0.49000 (  196 accepted /   400 generated)
statistics ngram_mod: #calls(b,g,a) = 1 348 21, #gen drafts = 25, #acc drafts = 21, #gen tokens = 400, #acc tokens = 196, dur(b,g,a) = 0.246, 0.580, 0.498 ms

With GLM-4.5 DRAFT model from jukofyork (-md GLM-4.5-DRAFT-0.6B-32k-Q4_0.gguf -ngld 999 --draft-p-min 0.5):

slot print_timing: id  0 | task -1 |
prompt eval time =   45439.06 ms /  5210 tokens (    8.72 ms per token,   114.66 tokens per second)
       eval time =   33312.28 ms /   545 tokens (   61.12 ms per token,    16.36 tokens per second)
      total time =   78751.34 ms /  5755 tokens
draft acceptance rate = 0.60417 (  493 accepted /   816 generated)
statistics draft: #calls(b,g,a) = 1 51 46, #gen drafts = 51, #acc drafts = 46, #gen tokens = 816, #acc tokens = 493, dur(b,g,a) = 0.000, 2293.628, 0.015 ms

I advise against --spec-type ngram-mod. While it's the one that sounds the best it's also the one that gave me the worst results. Also it seems that self speculative decoding works better the more conversational context you have, as it has more examples of past sequences to work with. There it looked like you have only 5k context, and especially if it's not past messages but rather a system prompt then results are bound to be poor.

@Nexesenex
Copy link
Contributor

Nexesenex commented Feb 11, 2026

Did some testing with GLM-4.7..

Baseline without speculative decoding:

slot print_timing: id  0 | task -1 |
prompt eval time =   45378.11 ms /  5210 tokens (    8.71 ms per token,   114.81 tokens per second)
       eval time =   53183.46 ms /   545 tokens (   97.58 ms per token,    10.25 tokens per second)
      total time =   98561.57 ms /  5755 tokens

With self speculative decoding (--spec-type ngram-mod --draft-p-min 0.5):

begin: ngram_mod occupancy = 2926/4194304 (0.00)
accept: low acceptance streak (3) – resetting ngram_mod
slot print_timing: id  0 | task -1 |
prompt eval time =   45379.94 ms /  5210 tokens (    8.71 ms per token,   114.81 tokens per second)
       eval time =   46855.81 ms /   545 tokens (   85.97 ms per token,    11.63 tokens per second)
      total time =   92235.75 ms /  5755 tokens
draft acceptance rate = 0.49000 (  196 accepted /   400 generated)
statistics ngram_mod: #calls(b,g,a) = 1 348 21, #gen drafts = 25, #acc drafts = 21, #gen tokens = 400, #acc tokens = 196, dur(b,g,a) = 0.246, 0.580, 0.498 ms

With GLM-4.5 DRAFT model from jukofyork (-md GLM-4.5-DRAFT-0.6B-32k-Q4_0.gguf -ngld 999 --draft-p-min 0.5):

slot print_timing: id  0 | task -1 |
prompt eval time =   45439.06 ms /  5210 tokens (    8.72 ms per token,   114.66 tokens per second)
       eval time =   33312.28 ms /   545 tokens (   61.12 ms per token,    16.36 tokens per second)
      total time =   78751.34 ms /  5755 tokens
draft acceptance rate = 0.60417 (  493 accepted /   816 generated)
statistics draft: #calls(b,g,a) = 1 51 46, #gen drafts = 51, #acc drafts = 46, #gen tokens = 816, #acc tokens = 493, dur(b,g,a) = 0.000, 2293.628, 0.015 ms

Did you use -sm layer or -sm graph for the GLM 4.7 main model?

@hunterx2591
Copy link

Looks like it dumps when the context shrinks otherwise about 20 percent increase normally get 16 tps ./build/bin/llama-server
--model "/home/xeon/ik_llama.cpp/models/Step-3.5-Flash-IQ5_K-00001-of-00004.gguf"
--alias "Step-Fun-3.5-flash"
--slot-save-path "/tmp/claw_cache/mem"
--prompt-cache "/tmp/claw_cache/mem/step_35_base.bin"
--prompt-cache-all
-c 196000 -ctk q8_0 -ctv q8_0
-b 4096
-amb 2048
--spec-type ngram-map-k
--spec-ngram-size-n 8
--draft-min 1
--draft-max 48
--draft-p-min 0.2
-mla 1
-fa on
-ub 4096
-ngl 99
-sm graph
-gr
-smgs
-ger
--n-cpu-moe 99
-ts 1,1
--parallel 1
--threads 42
--host 0.0.0.0
--port 8080
--merge-qkv
--jinja
--mirostat 2
--mirostat-ent 1.5
--mirostat-lr 0.076
INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="128799779225600" timestamp=1770857855 id_slot=0 id_task=2281 p0=39604
INFO [ release_slots] slot released | tid="128799779225600" timestamp=1770857874 id_slot=0 id_task=2281 n_ctx=196096 n_past=40048 n_system_tokens=0 n_cache_tokens=40048 truncated=false
INFO [ slots_idle] all slots are idle | tid="128799779225600" timestamp=1770857874
slot print_timing: id 0 | task -1 |
prompt eval time = 478.23 ms / 65 tokens ( 7.36 ms per token, 135.92 tokens per second)
eval time = 19350.44 ms / 380 tokens ( 50.92 ms per token, 19.64 tokens per second)
total time = 19828.68 ms / 445 tokens
draft acceptance rate = 0.21528 ( 31 accepted / 144 generated)
statistics ngram_map_k: #calls(b,g,a) = 9 2604 30, #gen drafts = 41, #acc drafts = 30, #gen tokens = 1968, #acc tokens = 178, dur(b,g,a) = 0.481, 3.070, 0.021 ms
INFO [ log_server_request] request | tid="128604348317696" timestamp=1770857874 remote_addr="100.74.165.82" remote_port=61179 status=200 method="POST" path="/v1/chat/completions" params={}
======== Prompt cache: cache size: 40048, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50

  • looking for better prompt, base f_keep = 0.998, sim = 0.997, n_keep = 0, n_discarded_prompt = 0
  • cache state: 1 prompts, 129.909 MiB (limits: 8192.000 MiB, 0 tokens, 87716 est)
    • prompt 0x5c710e751a10: 1391 tokens, 0 discarded, checkpoints: 0, 129.909 MiB
      prompt cache load took 5.09 ms
      INFO [ launch_slot_with_task] slot is processing task | tid="128799779225600" timestamp=1770857881 id_slot=0 id_task=2631
      ======== Cache: cache_size = 40048, n_past0 = 39974, n_past1 = 39974, n_past_prompt1 = 39974, n_past2 = 39974, n_past_prompt2 = 39974
      Common part does not match fully
      cache : <function=exec>
      { "command": "/Users/jenny/.openclaw/workspace/google_env/bin/python3 /Users/jen
      prompt: <function=exec<tool_call>
      <function=exec>
      <parameter=command>
      /Users/jenny/.openclaw/workspace/google_env
      INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="128799779225600" timestamp=1770857881 id_slot=0 id_task=2631 p0=39974
      INFO [ release_slots] slot released | tid="128799779225600" timestamp=1770857893 id_slot=0 id_task=2631 n_ctx=196096 n_past=40342 n_system_tokens=0 n_cache_tokens=40342 truncated=false
      INFO [ slots_idle] all slots are idle | tid="128799779225600" timestamp=1770857893
      slot print_timing: id 0 | task -1 |
      prompt eval time = 866.68 ms / 165 tokens ( 5.25 ms per token, 190.38 tokens per second)
      eval time = 10641.48 ms / 204 tokens ( 52.16 ms per token, 19.17 tokens per second)
      total time = 11508.15 ms / 369 tokens
      draft acceptance rate = 0.04167 ( 2 accepted / 48 generated)
      statistics ngram_map_k: #calls(b,g,a) = 10 2805 31, #gen drafts = 42, #acc drafts = 31, #gen tokens = 2016, #acc tokens = 180, dur(b,g,a) = 0.482, 3.229, 0.022 ms
      INFO [ log_server_request] request | tid="128604356710400" timestamp=1770857893 remote_addr="100.74.165.82" remote_port=61203 status=200 method="POST" path="/v1/chat/completions" params={}
      ======== Prompt cache: cache size: 40342, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50
  • looking for better prompt, base f_keep = 0.999, sim = 0.998, n_keep = 0, n_discarded_prompt = 0
  • cache state: 1 prompts, 129.909 MiB (limits: 8192.000 MiB, 0 tokens, 87716 est)
    • prompt 0x5c710e751a10: 1391 tokens, 0 discarded, checkpoints: 0, 129.909 MiB
      prompt cache load took 5.08 ms
      INFO [ launch_slot_with_task] slot is processing task | tid="128799779225600" timestamp=1770857898 id_slot=0 id_task=2834
      ======== Cache: cache_size = 40342, n_past0 = 40291, n_past1 = 40291, n_past_prompt1 = 40291, n_past2 = 40291, n_past_prompt2 = 40291
      Common part does not match fully
      cache : <function=exec>
      {
      "command": "ssh -o ConnectTimeout=5 jenny@100.91.114.7 'pg
      prompt: <function=exec<tool_call>
      <function=exec>
      <parameter=command>
      ssh -o ConnectTimeout=5 jenny@100.91.
      INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="128799779225600" timestamp=1770857898 id_slot=0 id_task=2834 p0=40291
      slot print_timing: id 0 | task -1 |
      prompt eval time = 722.96 ms / 127 tokens ( 5.69 ms per token, 175.67 tokens per second)
      eval time = 23985.04 ms / 426 tokens ( 56.30 ms per token, 17.76 tokens per second)
      total time = 24708.00 ms / 553 tokens
      draft acceptance rate = 0.06944 ( 10 accepted / 144 generated)
      statistics ngram_map_k: #calls(b,g,a) = 11 3220 34, #gen drafts = 45, #acc drafts = 34, #gen tokens = 2160, #acc tokens = 190, dur(b,g,a) = 0.483, 3.601, 0.024 ms
      INFO [ release_slots] slot released | tid="128799779225600" timestamp=1770857923 id_slot=0 id_task=2834 n_ctx=196096 n_past=40843 n_system_tokens=0 n_cache_tokens=40843 truncated=false
      INFO [ slots_idle] all slots are idle | tid="128799779225600" timestamp=1770857923
      INFO [ log_server_request] request | tid="128604339924992" timestamp=1770857923 remote_addr="100.74.165.82" remote_port=61207 status=200 method="POST" path="/v1/chat/completions" params={}
      ======== Prompt cache: cache size: 40843, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50
  • looking for better prompt, base f_keep = 1.000, sim = 0.999, n_keep = 0, n_discarded_prompt = 0
  • cache state: 1 prompts, 129.909 MiB (limits: 8192.000 MiB, 0 tokens, 87716 est)
    • prompt 0x5c710e751a10: 1391 tokens, 0 discarded, checkpoints: 0, 129.909 MiB
      prompt cache load took 4.55 ms
      INFO [ launch_slot_with_task] slot is processing task | tid="128799779225600" timestamp=1770858031 id_slot=0 id_task=3251
      ======== Cache: cache_size = 40843, n_past0 = 38153, n_past1 = 38153, n_past_prompt1 = 38153, n_past2 = 38153, n_past_prompt2 = 38153
      Common part does not match fully
      cache : <|im_start|>assistant
The user is asking if the music generation is done and if it saved to the desktop. I need to check the output prompt: <|im_start|>assistant ls -lt ~/Desktop | head -n 20

INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="128799779225600" timestamp=1770858031 id_slot=0 id_task=3251 p0=38153
common_ngram_map_begin: refresh map: idx_last_draft=40838, new begin=39857, #keys_checked=42, #keys_del=1, #values_del=0, #hashes_upd=313
/home/xeon/ik_llama.cpp/common/ngram-map.cpp:236: common_ngram_map_draft: map.idx_last_check > cur_len: 40417 > 39857
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)
xeon@xeon-System-Product-Name:~/ik_llama.cpp$

@MrHills-rs
Copy link

MrHills-rs commented Feb 12, 2026

There's a bug:

INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140416161607680" timestamp=1770882713 id_slot=0 id_task=1031 p0=2077 slot print_timing: id 0 | task -1 | prompt eval time = 4605.32 ms / 1824 tokens ( 2.52 ms per token, 396.06 tokens per second) eval time = 61879.23 ms / 424 tokens ( 145.94 ms per token, 6.85 tokens per second) total time = 66484.55 ms / 2248 tokens statistics ngram_map_k: #calls(b,g,a) = 4 1448 0, #gen drafts = 0, #acc drafts = 0, #gen tokens = 0, #acc tokens = 0, dur(b,g,a) = 0.003, 1.512, 0.000 ms INFO [ release_slots] slot released | tid="140416161607680" timestamp=1770882779 id_slot=0 id_task=1031 n_ctx=49152 n_past=4324 n_system_tokens=0 n_cache_tokens=4324 truncated=false INFO [ slots_idle] all slots are idle | tid="140416161607680" timestamp=1770882779 INFO [ log_server_request] request | tid="140405962170368" timestamp=1770882779 remote_addr="127.0.0.1" remote_port=56526 status=200 method="POST" path="/v1/chat/completions" params={} [INFO] Request 127.0.0.1 "POST /v1/chat/completions HTTP/1.1" 200 132559 "node-fetch" 1m6.494922219s ======== Prompt cache: cache size: 4324, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50 updating prompt cache - saving prompt with length 4324, total state size = 421.773 MiB prompt cache save took 85.99 ms - looking for better prompt, base f_keep = 0.001, sim = 0.001, n_keep = 0, n_discarded_prompt = 0 - found better prompt with f_keep = 0.001, sim = 0.001, n_keep = 0, n_discarded_prompt = 0 - cache state: 1 prompts, 421.773 MiB (limits: 8192.000 MiB, 0 tokens, 83984 est) - prompt 0x564b5fcacdd0: 4324 tokens, 0 discarded, checkpoints: 0, 421.773 MiB prompt cache load took 21.04 ms INFO [ launch_slot_with_task] slot is processing task | tid="140416161607680" timestamp=1770882906 id_slot=0 id_task=1456 ======== Cache: cache_size = 3030, n_past0 = 3, n_past1 = 3, n_past_prompt1 = 3, n_past2 = 3, n_past_prompt2 = 3 Common part does not match fully INFO [ batch_pending_prompt] kv cache rm [p0, end) | tid="140416161607680" timestamp=1770882906 id_slot=0 id_task=1456 p0=3 /home/marco/AI/ik/common/ngram-map.cpp:236: common_ngram_map_draft: map.idx_last_check > cur_len: 3900 > 2751 ptrace: Operation not permitted. No stack. The program is not being run. 2026/02/12 08:56:20 httputil: ReverseProxy read error during body copy: unexpected EOF [INFO] <qwen-vl-largectx> recovered from client disconnection during streaming [INFO] Request 127.0.0.1 "POST /v1/chat/completions HTTP/1.1" 200 632 "node-fetch" 1m14.317839132s[WARN] <qwen-vl-largectx> ExitError >> signal: aborted (core dumped), exit code: -1 [INFO] <qwen-vl-largectx> process exited but not StateStopping, current state: ready

Happens every time there a character swap, maybe whenever context is discarded?

./build/bin/llama-server
      -m /home/marco/AI/ik/models/Qwen3-VL-235B-A22B-Thinking.i1-IQ3_M.gguf
      --context-shift on
      -ot "blk\.(?:[0-9]|[1-7][0-9]|[8][0-5])\.ffn.*_exps.*=CPU"
      -c 49152
      -b 8192 -ub 4096
      -ctk q8_0 -ctv q8_0
      --threads 8 -ngl 95
      -cuda fusion=1,offload-batch-size=8,mmq-id-size=512
      -amb 512
      --host 127.0.0.1
      --port ${PORT}
      --webui none --jinja
      --repeat-last-n 2048 -mqkv -muge
      --reasoning-format deepseek-legacy
      --draft-min 1 --spec-ngram-size-n 8
      --draft-max 4 --spec-type ngram-map-k --draft-p-min 0.2

Also, this will happen with -spec-type ngram-map-k, but not with -spec-type ngram-simple.
(Using llama-swap, altho I don't think it's relevant)

@sayap
Copy link
Contributor

sayap commented Feb 12, 2026

Did you use -sm layer or -sm graph for the GLM 4.7 main model?

Neither. This is a Strix Halo + 1 RTX 3090 eGPU setup, and I only use the CPU of the Strix Halo. It can fit a 3.2 bpw quant of GLM 4.7.

@firecoperana
Copy link
Collaborator Author

firecoperana commented Feb 12, 2026

Something is wrong with mainline's ngram-map-k implementation as I had the same crash testing the mainline, but interestingly, ngram-map-k4v works fine.
While waiting for mainline's fix, I change the abort to warning.

@ikawrakow ikawrakow merged commit 1cb7e1b into main Feb 13, 2026
@abc-nix
Copy link
Contributor

abc-nix commented Feb 13, 2026

Is it normal, for the same seed, to have each generation be different? This breaks reproducibility for me. Am I doing something wrong?

Model: GLM 4.7 IQ4_k from ubergarm.

Relevant params:

      --temp 1.0
      -s 1234
      --spec-type ngram-map-k4v
      --draft-min 1 --draft-max 4 --spec-ngram-size-n 4 --draft-p-min 0.2

If I disable (not use) spec-decoding I get the same results (for the same seed) each generation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Self speculative decoding

10 participants