UPSTREAM PR #18471: Add self‑speculative decoding (no draft model required) by loci-dev · Pull Request #750 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-29T23:35:54Z

This PR introduces self-speculative decoding: instead of using a dedicated draft model (which is good, if available, see #18039), the current token history is used to predict future tokens. This can provide a speedup in cases where the output contains repeated parts of the prompt. A typical example is making many small changes in a large source file.

Example 1 (gpt-oss-120b in VRAM): Translation of a few comments in a Python script (chosen as a favorable case).

slot update_slots: id  3 | task 0 | created context checkpoint 1 of 8 (pos_min = 324, pos_max = 2883, size = 90.030 MiB)
slot print_timing: id  3 | task 0 | 
prompt eval time =     436.48 ms /  2948 tokens (    0.15 ms per token,  6754.03 tokens per second)
       eval time =   18886.86 ms /  3423 tokens (    5.52 ms per token,   181.24 tokens per second)
      total time =   19323.34 ms /  6371 tokens
slot      release: id  3 | task 0 | stop processing: n_tokens = 6370, truncated = 0

Same prompt with --draft-min 12 --draft-max 48 --spec-self 1:

slot update_slots: id  3 | task 0 | created context checkpoint 1 of 8 (pos_min = 324, pos_max = 2883, size = 90.030 MiB)
slot print_timing: id  3 | task 0 | 
prompt eval time =     431.85 ms /  2948 tokens (    0.15 ms per token,  6826.38 tokens per second)
       eval time =    7163.27 ms /  3193 tokens (    2.24 ms per token,   445.75 tokens per second)
      total time =    7595.13 ms /  6141 tokens
draft acceptance rate = 0.76827 ( 2397 accepted /  3120 generated)
slot      release: id  3 | task 0 | stop processing: n_tokens = 6140, truncated = 0

To keep the PR simple, the new argument --spec-self reuses the same draft-min and draft-max values as used for a potential draft model. When combining both speculative decoding methods, these values are shared (no independent tuning of min/max for each method).

Example 2 (Qwen3-235B, with heavy offloading):

slot update_slots: id  3 | task 0 | prompt done, n_tokens = 2962, batch.n_tokens = 914
slot print_timing: id  3 | task 0 |
prompt eval time =   15606.37 ms /  2962 tokens (    5.27 ms per token,   189.79 tokens per second)
       eval time =  252551.71 ms /  2973 tokens (   84.95 ms per token,    11.77 tokens per second)
      total time =  268158.08 ms /  5935 tokens
srv  log_server_r: request: POST /v1/chat/completions 192.168.32.208 200

Same prompt with --draft-min 15 --draft-max 40 --spec-self 1:

slot update_slots: id  3 | task 0 | prompt done, n_tokens = 2962, batch.n_tokens = 914
slot print_timing: id  3 | task 0 | 
prompt eval time =   15474.80 ms /  2962 tokens (    5.22 ms per token,   191.41 tokens per second)
       eval time =  141116.29 ms /  2963 tokens (   47.63 ms per token,    21.00 tokens per second)
      total time =  156591.09 ms /  5925 tokens
draft acceptance rate = 0.86304 ( 2382 accepted /  2760 generated)

This speedup factor (from ~12 to ~21 tokens/s) occurs only in favorable cases with large repeated sections!

The algorithm is simple: search for a pattern of length draft-min in the token history and use the subsequent draft-max tokens for speculation. No further optimizations are implemented. I had the idea for this PR while waiting for a source file to finish at 5 t/s ;-)

loci-dev had a problem deploying to PROD__AL_DEMO December 29, 2025 23:35 — with GitHub Actions Failure

loci-dev force-pushed the main branch 24 times, most recently from ca06125 to 76fc6ba Compare January 2, 2026 00:37

loci-dev force-pushed the upstream-PR18471-branch_srogmann-feature/self-speculative branch from 7b3d537 to 9fee55e Compare January 2, 2026 00:48

loci-dev had a problem deploying to PROD__AL_DEMO January 2, 2026 00:49 — with GitHub Actions Failure

loci-dev force-pushed the main branch 3 times, most recently from 86bf5db to 07aff19 Compare January 2, 2026 17:07

loci-dev force-pushed the main branch 28 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

loci-dev force-pushed the main branch 2 times, most recently from 0cb533b to ef7afbe Compare February 13, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18471: Add self‑speculative decoding (no draft model required)#750

UPSTREAM PR #18471: Add self‑speculative decoding (no draft model required)#750
loci-dev wants to merge 26 commits intomainfrom
upstream-PR18471-branch_srogmann-feature/self-speculative

loci-dev commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

loci-dev commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants