Add speculative decoding #1120

abetlen · 2024-01-23T17:08:22Z

Uses prompt lookup decoding but the draft model class can be extended to support almost any existing method.

Server Usage

python3 -m llama_cpp.server --model models/7B/llama-model.gguf --draft_model=prompt-lookup-decoding --draft_model_num_pred_tokens=2

Python Usage

>>> from llama_cpp import Llama
>>> from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
>>> llm = Llama(
    model_path="./models/7B/llama-model.gguf", 
    draft_model=LlamaPromptLookupDecoding(
        num_pred_tokens=10, # Good default for gpu offloading, 2 is better for cpu-only machines
        max_ngram_size=2, # 2 is the huggingface implementation and found to work the best for me as well.
    )
)

Performance

This is a very dumb / easy example but it looks like it's working!

With prompt lookup decoding

Without prompt lookup decoding

Closes #675

…ecoding draft model

abetlen · 2024-01-24T03:59:43Z

Tried on a more realistic example and got worse performance, think I'll need to tune / implement a heuristic for draft models similar to https://huggingface.co/blog/assisted-generation

Adjust the number of candidate tokens to be produced in the next iteration — our original heuristic increases it by 2 if ALL tokens match and decreases it by 1 otherwise.

…to add-speculative-decoding

…pp_python into add-speculative-decoding

abetlen · 2024-01-24T04:46:37Z

Added the adaptive heuristic and it does do better but still occasionally slower even with termperature=0, will need to investigate.

oobabooga · 2024-01-25T13:44:19Z

Highly appreciated PR. Is it possible to make prompt_lookup_num_tokens a generation parameter on the same footing as temperature as is done in the transformers library? That would make it possible to change that parameter without having to reload the model.

abetlen · 2024-01-25T15:22:14Z

@oobabooga I saw, I was looking at the hf implementation as a reference. I could add it as a general num_pred_tokens because I want to keep it open to other implementations of speculative decoding. I'll think on that one.

…/llama-cpp-python into add-speculative-decoding

abetlen · 2024-01-31T19:04:53Z

@oobabooga going to merge this now. For updating the draft model or it's properties without re-creating the entire Llama model class just assume that you can access llm.draft_model, set it to None to disable.

oobabooga · 2024-01-31T20:21:55Z

Awesome, thanks @abetlen!

abetlen added 5 commits January 22, 2024 15:46

Add draft model param to llama class, implement basic prompt lookup d…

cee8c0f

…ecoding draft model

Merge branch 'main' into add-speculative-decoding

2ff7247

Use samplingcontext for sampling

be688da

Use 1d array

8fe1c48

Use draft model for sampling

92cf2c4

abetlen mentioned this pull request Jan 23, 2024

Speculative sampling #675

Closed

abetlen added 2 commits January 23, 2024 17:11

Fix dumb mistake

b4976da

Allow for later extensions to the LlamaDraftModel api

fae83f2

abetlen changed the title ~~Add speculative decoding support~~ Add speculative decoding Jan 23, 2024

abetlen added 4 commits January 23, 2024 18:43

Merge branch 'main' into add-speculative-decoding

e4e029e

Cleanup

346a6c5

Merge remote-tracking branch 'origin' into add-speculative-decoding

eae4286

Merge branch 'main' into add-speculative-decoding

e2dccf2

abetlen added 3 commits January 23, 2024 23:11

Merge branch 'main' of https://github.com/abetlen/llama-cpp-python in…

9b46cb9

…to add-speculative-decoding

Adaptive candidate prediction

8415837

Merge branch 'add-speculative-decoding' of github.com:abetlen/llama_c…

a9d1da2

…pp_python into add-speculative-decoding

abetlen added 2 commits January 24, 2024 11:40

Update implementation to match hf transformers

c363eee

Tuning

5ab5999

abetlen added 8 commits January 26, 2024 11:30

Merge branch 'main' into add-speculative-decoding

6732261

Fix bug where last token was not used for ngram prediction

f39690c

Remove heuristic for num_pred_tokens (no benefit)

c6013e2

Merge branch 'main' into add-speculative-decoding

515483a

Merge branch 'add-speculative-decoding' of https://github.com/abetlen…

edc3390

…/llama-cpp-python into add-speculative-decoding

fix: n_candidates bug.

4f946b0

Add draft_model_num_pred_tokens server setting

df93d1d

Cleanup

995d40c

Update README

291eadc

abetlen merged commit fb762a6 into main Jan 31, 2024
16 checks passed

abetlen deleted the add-speculative-decoding branch January 31, 2024 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add speculative decoding #1120

Add speculative decoding #1120

abetlen commented Jan 23, 2024 •

edited

Loading

abetlen commented Jan 24, 2024

abetlen commented Jan 24, 2024

oobabooga commented Jan 25, 2024

abetlen commented Jan 25, 2024 •

edited

Loading

abetlen commented Jan 31, 2024

oobabooga commented Jan 31, 2024

Add speculative decoding #1120

Add speculative decoding #1120

Conversation

abetlen commented Jan 23, 2024 • edited Loading

Server Usage

Python Usage

Performance

abetlen commented Jan 24, 2024

abetlen commented Jan 24, 2024

oobabooga commented Jan 25, 2024

abetlen commented Jan 25, 2024 • edited Loading

abetlen commented Jan 31, 2024

oobabooga commented Jan 31, 2024

abetlen commented Jan 23, 2024 •

edited

Loading

abetlen commented Jan 25, 2024 •

edited

Loading