Speedup speculative decoding by implementing `fr-spec` #24343

eitanturok · 2025-09-05T20:41:13Z

Implement FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling to speedup speculative decoding. @keyboardAnt @Achazwl @jmamou.

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-09-07T20:12:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @eitanturok.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-09-08T19:50:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @eitanturok.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

eitanturok · 2025-09-18T22:22:29Z

I fixed a couple of issues with the previous benchmark. Turns out, we were compute bound, not memory bound. I re-ran the benchmark and got:

num-spec-tokens=1, max-seq-len=1
- eagle-2 is 61% faster then vanilla
- fr-spec is 68% faster then vanilla
num-spec-tokens=39, max-seq-len=1
- eagle-2 is ??% faster then vanilla
- fr-spec is ??% faster then vanilla

Like before, I benchmarked vanilla, eagle-2, and fr-spec on mt-bench with llama-3.1-8b-instruct on 100 prompts.

Speculative Decoding Benchmark Results

Method	Depth	Branching	Num Spec Tokens	Mean Acceptance Length	Decoding Throughput (tokens/s)	Total Time (s)	Forward Ratio
Eagle	3	1	3	2.31	119.31	181.29	0.137
fr-spec	3	1	3	2.23	121.46	178.13	0.138
Eagle	1	3	3	2.31	120.16	180.02	0.138
fr-spec	1	3	3	2.23	121.99	177.36	0.139
fr-spec	1	1	1	1.68	104.59	206.23	0.044
Eagle	1	1	1	1.71	100.01	215.60	0.044
Vanilla	N/A	N/A	0	1.00	61.74	347.07	0.000

Commands to reproduce the table

Method	Depth	Branching	Num Spec Tokens	Mean Acceptance Length	Decoding Throughput (tokens/s)	Total Time (s)	Forward Ratio	Command
Eagle	3	1	3	2.31	119.31	181.29	0.137	`VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 100 --max-num-seqs 1 --compilation-config '{"level": "0"}' --num-spec-tokens 3`
fr-spec	3	1	3	2.23	121.46	178.13	0.138	`VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 100 --max-num-seqs 1 --compilation-config '{"level": "0"}' --num-spec-tokens 3 --draft-vocab-frequency-path 'eturok/llama-3.1-8b-instruct-vocab-freq/vocab_freq.pt' --draft-vocab-frequency-keep-threshold 0.25`
Eagle	1	3	3	2.31	120.16	180.02	0.138	`VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 100 --max-num-seqs 1 --compilation-config '{"level": "0"}' --num-spec-tokens 3 --spec-token-tree-depth 1 --spec-token-tree-branching 3`
fr-spec	1	3	3	2.23	121.99	177.36	0.139	`VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 100 --max-num-seqs 1 --compilation-config '{"level": "0"}' --num-spec-tokens 3 --spec-token-tree-depth 1 --spec-token-tree-branching 3 --draft-vocab-frequency-path 'eturok/llama-3.1-8b-instruct-vocab-freq/vocab_freq.pt' --draft-vocab-frequency-keep-threshold 0.25`
fr-spec	1	1	1	1.68	104.59	206.23	0.044	`VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 100 --max-num-seqs 1 --compilation-config '{"level": "0"}' --num-spec-tokens 1 --draft-vocab-frequency-path 'eturok/llama-3.1-8b-instruct-vocab-freq/vocab_freq.pt' --draft-vocab-frequency-keep-threshold 0.25`
Eagle	1	1	1	1.71	100.01	215.60	0.044	`VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 100 --max-num-seqs 1 --compilation-config '{"level": "0"}' --num-spec-tokens 1`
Vanilla	N/A	N/A	0	1.00	61.74	347.07	0.000	`VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 100 --max-num-seqs 1 --compilation-config '{"level": "0"}' --num-spec-tokens 0`

Observations:

We definetly get a speedup over vanilla.
In eagle num-spec-tokens=1, the drafter forward pass takes 4% of the time of the target forward pass. So the drafter forward pass is not nearly a huge bottleneck, so we don't expect fr-spec to speed things up that much. Another perspective is that if the drafter forward pass is not a huge bottleneck, then maybe we can "sneak" in more speculative tokens to get a higher accepted length without incurring the cost of more compute.

I made several fixes:

We previously ran the benchmark with max-seq-length=100 and max-seq-length=100. I accidetanlly called max-seq-length the batch-size previously but these are diff as max=seq-length is the number of user prompts we process at once. In this setting, we are compute bound and don't really see a speedup because speculative decoding is faster only when we are memory bound. So this time, we run with max-seq-length=1 as @Achazwl suggested. Also, in the fr-spec repo, it looks like they also ran with a batch size of one here and here.
It seems like cudagraphs are broken for speculative decoding drafters. This gives vanilla an unfair advantage in comparsirons. and so I turned off any model compilation for a fair comparison.
When I set num-speculative-tokens to 32, eagle-2 became incredibly slow.

jmamou · 2025-09-29T11:56:52Z

vllm/v1/spec_decode/eagle.py

+                old_weight = self.model.lm_head.weight
+
+                # In-place pruning of the weight
+                self.model.lm_head.weight.data = self.model.lm_head.weight.data[self.pruned_vocab].clone().detach()


@eitanturok

Since you are selcting part of the indices you should ensure that self.model.lm_head.weight.data[self.pruned_vocab].clone().detach() is contiguous.
I guess self.model.lm_head.weight.data[self.pruned_vocab].clone().detach().contiguous() should work.
Look at
https://docs.pytorch.org/docs/stable/generated/torch.Tensor.contiguous.html#torch-tensor-contiguous

eitanturok · 2025-09-30T04:20:31Z

@jmamou self.model.lm_head.weight.data is already contiguous so we don't need to add .contiguous().

I added a print to the code

self.model.lm_head.weight.data = self.model.lm_head.weight.data[self.pruned_vocab].clone().detach()
print(self.model.lm_head.weight.data.is_contiguous())

ran the cmd

VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py \
  --dataset-name hf \
  --dataset-path philschmid/mt-bench \
  --num-prompts 100 \
  --compilation-config '{"level": "0"}' \
  --max-num-seqs 1 \
  --num-spec-tokens 1 \
  --draft-vocab-frequency-path 'thunlp/LLaMA3-Instruct-8B-FR-Spec/freq_32768.pt' \
  --draft-vocab-frequency-keep-threshold 0.25

and got the output

self.model.lm_head.weight.data.is_contiguous(): True

For more details, see this pytorch PR.

mergify · 2025-10-08T14:37:02Z

Documentation preview: https://vllm--24343.org.readthedocs.build/en/24343/

keyboardAnt · 2025-11-12T20:58:46Z

I fixed a couple of issues with the previous benchmark. Turns out, we were compute bound, not memory bound. I re-ran the benchmark and got:

num-spec-tokens=1, max-seq-len=1

eagle-2 is 61% faster then vanilla

fr-spec is 68% faster then vanilla

num-spec-tokens=39, max-seq-len=1

eagle-2 is ??% faster then vanilla

fr-spec is ??% faster then vanilla

Like before, I benchmarked vanilla, eagle-2, and fr-spec on mt-bench with llama-3.1-8b-instruct on 100 prompts.

Speculative Decoding Benchmark Results

Method Depth Branching Num Spec Tokens Mean Acceptance Length Decoding Throughput (tokens/s) Total Time (s) Forward Ratio
Eagle 3 1 3 2.31 119.31 181.29 0.137
fr-spec 3 1 3 2.23 121.46 178.13 0.138
Eagle 1 3 3 2.31 120.16 180.02 0.138
fr-spec 1 3 3 2.23 121.99 177.36 0.139
fr-spec 1 1 1 1.68 104.59 206.23 0.044
Eagle 1 1 1 1.71 100.01 215.60 0.044
Vanilla N/A N/A 0 1.00 61.74 347.07 0.000
Commands to reproduce the table
Method Depth Branching Num Spec Tokens Mean Acceptance Length Decoding Throughput (tokens/s) Total Time (s) Forward Ratio Command
Eagle 3 1 3 2.31 119.31 181.29 0.137 VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 100 --max-num-seqs 1 --compilation-config '{"level": "0"}' --num-spec-tokens 3
fr-spec 3 1 3 2.23 121.46 178.13 0.138 VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 100 --max-num-seqs 1 --compilation-config '{"level": "0"}' --num-spec-tokens 3 --draft-vocab-frequency-path 'eturok/llama-3.1-8b-instruct-vocab-freq/vocab_freq.pt' --draft-vocab-frequency-keep-threshold 0.25
Eagle 1 3 3 2.31 120.16 180.02 0.138 VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 100 --max-num-seqs 1 --compilation-config '{"level": "0"}' --num-spec-tokens 3 --spec-token-tree-depth 1 --spec-token-tree-branching 3
fr-spec 1 3 3 2.23 121.99 177.36 0.139 VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 100 --max-num-seqs 1 --compilation-config '{"level": "0"}' --num-spec-tokens 3 --spec-token-tree-depth 1 --spec-token-tree-branching 3 --draft-vocab-frequency-path 'eturok/llama-3.1-8b-instruct-vocab-freq/vocab_freq.pt' --draft-vocab-frequency-keep-threshold 0.25
fr-spec 1 1 1 1.68 104.59 206.23 0.044 VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 100 --max-num-seqs 1 --compilation-config '{"level": "0"}' --num-spec-tokens 1 --draft-vocab-frequency-path 'eturok/llama-3.1-8b-instruct-vocab-freq/vocab_freq.pt' --draft-vocab-frequency-keep-threshold 0.25
Eagle 1 1 1 1.71 100.01 215.60 0.044 VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 100 --max-num-seqs 1 --compilation-config '{"level": "0"}' --num-spec-tokens 1
Vanilla N/A N/A 0 1.00 61.74 347.07 0.000 VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 100 --max-num-seqs 1 --compilation-config '{"level": "0"}' --num-spec-tokens 0
Observations:

We definetly get a speedup over vanilla.

In eagle num-spec-tokens=1, the drafter forward pass takes 4% of the time of the target forward pass. So the drafter forward pass is not nearly a huge bottleneck, so we don't expect fr-spec to speed things up that much. Another perspective is that if the drafter forward pass is not a huge bottleneck, then maybe we can "sneak" in more speculative tokens to get a higher accepted length without incurring the cost of more compute.

I made several fixes:

We previously ran the benchmark with max-seq-length=100 and max-seq-length=100. I accidetanlly called max-seq-length the batch-size previously but these are diff as max=seq-length is the number of user prompts we process at once. In this setting, we are compute bound and don't really see a speedup because speculative decoding is faster only when we are memory bound. So this time, we run with max-seq-length=1 as @Achazwl suggested. Also, in the fr-spec repo, it looks like they also ran with a batch size of one here and here.

It seems like cudagraphs are broken for speculative decoding drafters. This gives vanilla an unfair advantage in comparsirons. and so I turned off any model compilation for a fair comparison.

When I set num-speculative-tokens to 32, eagle-2 became incredibly slow.

@eitanturok -

Forward ratios:
What fraction of the vocab remains after pruning here? (Why does the forward ratio of frspec equal the forward ratio of eagle in this benchmark?)

Complete benchmark:
frspec seems to increase throughput by 1.52% (=100*(121.99-120.16)/120.16) in the benchmark above. Are there any blockers to running a sweep over different configurations? For example, (eagle_model, eagle_params, dataset, batch_size, hardware) combinations, where eagle_params defines the draft-tree shape (max_depth, max_width, max_num_of_nodes).

My intuition is that (i) large batches evaluated on datasets with (ii) long inputs that induce (iii) long outputs (e.g., GovReport, BookSum) are more likely to demonstrate significant improvements, based on this microbenchmark: #24506 (comment).

eitanturok added 3 commits September 5, 2025 14:05

docs

c54b324

cleaner

692a066

update

263670f

mergify bot added speculative-decoding v1 labels Sep 5, 2025

eitanturok changed the title ~~Implement fr-spec to speedup speculative decoding~~ Implement fr-spec to speedup speculative decoding Sep 5, 2025

eitanturok added 6 commits September 5, 2025 21:09

download pruned vocab

ba7f0db

add logger; fix load_draft_vocab_pruned

b046119

get model device

0c59e9d

override the lm head

9880e78

grrr spelling mistake

b94836c

good print statements

9ebe01f

mergify bot added the needs-rebase label Sep 7, 2025

eitanturok added 3 commits September 7, 2025 21:06

pruned drafter has correct dims

a0ace47

clean up

f5c6d76

pruned draft vocab works with no compilation

a41bb95

mergify bot removed the needs-rebase label Sep 7, 2025

eitanturok added 6 commits September 8, 2025 15:41

remove ic

10863c3

start test

6a45511

fix test imports

e5b375c

test setup

5852deb

comment out asserts

8f75092

init benchmark

81138d6

mergify bot added the performance Performance-related issues label Sep 8, 2025

benchmark

e714f9f

mergify bot added the documentation Improvements or additions to documentation label Sep 8, 2025

rmv max_model_len, add draft_vocab_pruned

9d82d63

mergify bot added the needs-rebase label Sep 8, 2025

eitanturok added 11 commits September 16, 2025 20:08

max_num_seqs arg instead of batch_size

b0f76bf

no spec decoding when num-spec-tokens==0

7907ab1

Cleaner prints

0bcba47

track forward_times

278c613

measure target, drafter forward times

34f1a4c

better print

b05e5ec

log everything

b4a7d64

vanilla prints stats now

bcf9510

remove extra import

39d621f

code for branch/depth spec token tree

327f8ff

run some scripts

6996221

eitanturok added 4 commits September 18, 2025 22:42

update

6a726ae

delete

4bfd0bb

use less memory

8019987

let's track the outputs

187791a

mergify bot added the frontend label Sep 19, 2025

jmamou reviewed Sep 29, 2025

View reviewed changes

kmk142789 approved these changes Sep 30, 2025

View reviewed changes

eitanturok added 2 commits September 30, 2025 06:53

better print

d3d9d8d

more

26bc61c

eitanturok added 6 commits October 9, 2025 11:13

print more things

2e6aa0d

more results

48e5d93

handle branch/depth spec token tree better

1f2991a

fix output printing

828f433

print better

344ece4

multi-turn inference works

03f3b83

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Speedup speculative decoding by implementing `fr-spec` #24343

Speedup speculative decoding by implementing `fr-spec` #24343

Uh oh!

eitanturok commented Sep 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Sep 7, 2025

Uh oh!

mergify bot commented Sep 8, 2025

Uh oh!

eitanturok commented Sep 18, 2025

Uh oh!

jmamou Sep 29, 2025 •

edited

Loading

Uh oh!

eitanturok commented Sep 30, 2025 •

edited

Loading

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

keyboardAnt commented Nov 12, 2025 •

edited

Loading

Speculative Decoding Benchmark Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Speedup speculative decoding by implementing fr-spec #24343

Are you sure you want to change the base?

Speedup speculative decoding by implementing fr-spec #24343

Uh oh!

Conversation

eitanturok commented Sep 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Sep 7, 2025

Uh oh!

mergify bot commented Sep 8, 2025

Uh oh!

eitanturok commented Sep 18, 2025

Speculative Decoding Benchmark Results

Uh oh!

jmamou Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eitanturok commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

keyboardAnt commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speculative Decoding Benchmark Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Speedup speculative decoding by implementing `fr-spec` #24343

Speedup speculative decoding by implementing `fr-spec` #24343

eitanturok commented Sep 5, 2025 •

edited by github-actions bot

Loading

jmamou Sep 29, 2025 •

edited

Loading

eitanturok commented Sep 30, 2025 •

edited

Loading

keyboardAnt commented Nov 12, 2025 •

edited

Loading