-
Notifications
You must be signed in to change notification settings - Fork 964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add speculative decoding #1120
Add speculative decoding #1120
Conversation
Tried on a more realistic example and got worse performance, think I'll need to tune / implement a heuristic for draft models similar to https://huggingface.co/blog/assisted-generation
|
…to add-speculative-decoding
…pp_python into add-speculative-decoding
Added the adaptive heuristic and it does do better but still occasionally slower even with termperature=0, will need to investigate. |
Highly appreciated PR. Is it possible to make |
@oobabooga I saw, I was looking at the hf implementation as a reference. I could add it as a general |
…/llama-cpp-python into add-speculative-decoding
@oobabooga going to merge this now. For updating the draft model or it's properties without re-creating the entire |
Awesome, thanks @abetlen! |
Uses prompt lookup decoding but the draft model class can be extended to support almost any existing method.
Server Usage
Python Usage
Performance
This is a very dumb / easy example but it looks like it's working!
With prompt lookup decoding
Without prompt lookup decoding
Closes #675