UPSTREAM PR #18471: Add self‑speculative decoding (no draft model required)#750
Open
UPSTREAM PR #18471: Add self‑speculative decoding (no draft model required)#750
Conversation
ca06125 to
76fc6ba
Compare
7b3d537 to
9fee55e
Compare
86bf5db to
07aff19
Compare
048ad94 to
6c1fde6
Compare
0cb533b to
ef7afbe
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mirrored from ggml-org/llama.cpp#18471
This PR introduces self-speculative decoding: instead of using a dedicated draft model (which is good, if available, see #18039), the current token history is used to predict future tokens. This can provide a speedup in cases where the output contains repeated parts of the prompt. A typical example is making many small changes in a large source file.
Example 1 (
gpt-oss-120bin VRAM): Translation of a few comments in a Python script (chosen as a favorable case).Same prompt with
--draft-min 12 --draft-max 48 --spec-self 1:To keep the PR simple, the new argument
--spec-selfreuses the samedraft-minanddraft-maxvalues as used for a potential draft model. When combining both speculative decoding methods, these values are shared (no independent tuning of min/max for each method).Example 2 (
Qwen3-235B, with heavy offloading):Same prompt with
--draft-min 15 --draft-max 40 --spec-self 1:This speedup factor (from ~12 to ~21 tokens/s) occurs only in favorable cases with large repeated sections!
The algorithm is simple: search for a pattern of length
draft-minin the token history and use the subsequentdraft-maxtokens for speculation. No further optimizations are implemented. I had the idea for this PR while waiting for a source file to finish at 5 t/s ;-)