Skip to content

UPSTREAM PR #18471: Add self‑speculative decoding (no draft model required)#750

Open
loci-dev wants to merge 26 commits intomainfrom
upstream-PR18471-branch_srogmann-feature/self-speculative
Open

UPSTREAM PR #18471: Add self‑speculative decoding (no draft model required)#750
loci-dev wants to merge 26 commits intomainfrom
upstream-PR18471-branch_srogmann-feature/self-speculative

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18471

This PR introduces self-speculative decoding: instead of using a dedicated draft model (which is good, if available, see #18039), the current token history is used to predict future tokens. This can provide a speedup in cases where the output contains repeated parts of the prompt. A typical example is making many small changes in a large source file.

Example 1 (gpt-oss-120b in VRAM): Translation of a few comments in a Python script (chosen as a favorable case).

slot update_slots: id  3 | task 0 | created context checkpoint 1 of 8 (pos_min = 324, pos_max = 2883, size = 90.030 MiB)
slot print_timing: id  3 | task 0 | 
prompt eval time =     436.48 ms /  2948 tokens (    0.15 ms per token,  6754.03 tokens per second)
       eval time =   18886.86 ms /  3423 tokens (    5.52 ms per token,   181.24 tokens per second)
      total time =   19323.34 ms /  6371 tokens
slot      release: id  3 | task 0 | stop processing: n_tokens = 6370, truncated = 0

Same prompt with --draft-min 12 --draft-max 48 --spec-self 1:

slot update_slots: id  3 | task 0 | created context checkpoint 1 of 8 (pos_min = 324, pos_max = 2883, size = 90.030 MiB)
slot print_timing: id  3 | task 0 | 
prompt eval time =     431.85 ms /  2948 tokens (    0.15 ms per token,  6826.38 tokens per second)
       eval time =    7163.27 ms /  3193 tokens (    2.24 ms per token,   445.75 tokens per second)
      total time =    7595.13 ms /  6141 tokens
draft acceptance rate = 0.76827 ( 2397 accepted /  3120 generated)
slot      release: id  3 | task 0 | stop processing: n_tokens = 6140, truncated = 0

To keep the PR simple, the new argument --spec-self reuses the same draft-min and draft-max values as used for a potential draft model. When combining both speculative decoding methods, these values are shared (no independent tuning of min/max for each method).

Example 2 (Qwen3-235B, with heavy offloading):

slot update_slots: id  3 | task 0 | prompt done, n_tokens = 2962, batch.n_tokens = 914
slot print_timing: id  3 | task 0 |
prompt eval time =   15606.37 ms /  2962 tokens (    5.27 ms per token,   189.79 tokens per second)
       eval time =  252551.71 ms /  2973 tokens (   84.95 ms per token,    11.77 tokens per second)
      total time =  268158.08 ms /  5935 tokens
srv  log_server_r: request: POST /v1/chat/completions 192.168.32.208 200

Same prompt with --draft-min 15 --draft-max 40 --spec-self 1:

slot update_slots: id  3 | task 0 | prompt done, n_tokens = 2962, batch.n_tokens = 914
slot print_timing: id  3 | task 0 | 
prompt eval time =   15474.80 ms /  2962 tokens (    5.22 ms per token,   191.41 tokens per second)
       eval time =  141116.29 ms /  2963 tokens (   47.63 ms per token,    21.00 tokens per second)
      total time =  156591.09 ms /  5925 tokens
draft acceptance rate = 0.86304 ( 2382 accepted /  2760 generated)

This speedup factor (from ~12 to ~21 tokens/s) occurs only in favorable cases with large repeated sections!

The algorithm is simple: search for a pattern of length draft-min in the token history and use the subsequent draft-max tokens for speculation. No further optimizations are implemented. I had the idea for this PR while waiting for a source file to finish at 5 t/s ;-)

@loci-dev loci-dev force-pushed the main branch 24 times, most recently from ca06125 to 76fc6ba Compare January 2, 2026 00:37
@loci-dev loci-dev force-pushed the upstream-PR18471-branch_srogmann-feature/self-speculative branch from 7b3d537 to 9fee55e Compare January 2, 2026 00:48
@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 86bf5db to 07aff19 Compare January 2, 2026 17:07
@loci-dev loci-dev force-pushed the main branch 28 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 0cb533b to ef7afbe Compare February 13, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants