Skip to content

Conversation

@aurickq
Copy link
Contributor

@aurickq aurickq commented Sep 26, 2025

Purpose

This PR adds Suffix Decoding (https://arxiv.org/abs/2411.04975) as a new speculative decoding method in vLLM. Suffix Decoding is a dynamic n-gram matching method that:

  1. Uses suffix trees to generate speculative tokens quickly using branch frequency counts.
  2. Can keep a history of prior model responses, which tends to work very well with repetitive agentic use cases.
  3. Can be dynamically updated with newly generated tokens, and FIFO eviction of older requests.

Test Plan

  • Benchmark Suffix Decoding against the current ngram speculator.
  • Write and run unit tests
  • Documentation

Test Result

Benchmarks on Specbench and Blazedit are below (on H200). Suffix Decoding beats ngram in pretty much all cases. In practice, we have seen larger speedups for real user interactions and agentic requests, since they tend to exhibit more output repetition than these benchmark datasets.

Script for benchmark reproduction: benchmark.sh

Specbench

Time per output token (ms)

method spec_len concurrency 1 concurrency 4 concurrency 16 concurrency 64
suffix (w/ cache) 5 4.4 4.64 5.85 10.55
suffix (w/ cache) 12 4.39 4.63 5.85 10.66
suffix (w/ cache) 32 4.39 4.63 5.82 10.67
suffix (w/o cache) 5 4.74 5.06 6.16 10.67
suffix (w/o cache) 12 4.73 5.02 6.15 10.76
suffix (w/o cache) 32 4.76 5.05 6.2 10.73
ngram [5, 5] 5 5.6 5.84 6.94 11.07
ngram [5, 5] 12 5.58 5.8 6.89 11.19
ngram [5, 5] 32 5.59 5.82 7.04 11.83
ngram [3, 5] 5 5.21 5.5 6.61 10.66
ngram [3, 5] 12 5.16 5.44 6.59 11.15
ngram [3, 5] 32 5.18 5.52 6.87 13.37

Total drafted tokens

method spec_len concurrency 1 concurrency 4 concurrency 16 concurrency 64
suffix (w/ cache) 5 68790 69238 68795 68452
suffix (w/ cache) 12 71154 71655 70952 71446
suffix (w/ cache) 32 71154 71378 71531 71283
suffix (w/o cache) 5 48012 48139 48081 48164
suffix (w/o cache) 12 50043 50258 50326 50282
suffix (w/o cache) 32 50043 49761 50466 49928
ngram [5, 5] 5 12460 12615 12610 12590
ngram [5, 5] 12 26268 26307 26673 26629
ngram [5, 5] 32 65293 65338 64615 64327
ngram [3, 5] 5 31606 31826 31608 31460
ngram [3, 5] 12 69535 69035 68498 68005
ngram [3, 5] 32 172779 169136 169809 170677

Total accepted tokens

method spec_len concurrency 1 concurrency 4 concurrency 16 concurrency 64
suffix (w/ cache) 5 18537 18727 18461 18437
suffix (w/ cache) 12 18609 18781 18614 18780
suffix (w/ cache) 32 18609 18751 18852 18654
suffix (w/o cache) 5 15401 15486 15377 15534
suffix (w/o cache) 12 15442 15637 15628 15558
suffix (w/o cache) 32 15442 15361 15669 15568
ngram [5, 5] 5 4757 4812 4794 4741
ngram [5, 5] 12 5046 5208 5046 5179
ngram [5, 5] 32 5149 5219 5203 5109
ngram [3, 5] 5 9278 9260 9288 9242
ngram [3, 5] 12 9857 9678 9722 9782
ngram [3, 5] 32 10040 9856 10011 9975

Blazedit

Time per output token (ms)

method spec_len concurrency 1 concurrency 4 concurrency 16 concurrency 64
suffix (w/ cache) 5 2.13 2.44 3.27 5.96
suffix (w/ cache) 12 1.77 2.04 2.84 5.77
suffix (w/ cache) 32 1.82 2.01 2.88 5.63
suffix (w/o cache) 5 2.22 2.44 3.31 5.99
suffix (w/o cache) 12 1.89 2.09 2.88 5.62
suffix (w/o cache) 32 1.91 2.11 2.85 5.63
ngram [5, 5] 5 2.75 3.05 3.99 6.66
ngram [5, 5] 12 2.41 2.68 3.51 6.23
ngram [5, 5] 32 2.23 2.51 3.55 7.46
ngram [3, 5] 5 2.44 2.69 3.57 6.18
ngram [3, 5] 12 2.05 2.31 3.11 6.03
ngram [3, 5] 32 1.86 2.22 3.33 8.13

Total drafted tokens

method spec_len concurrency 1 concurrency 4 concurrency 16 concurrency 64
suffix (w/ cache) 5 161067 164646 163410 171591
suffix (w/ cache) 12 188892 185344 186202 179407
suffix (w/ cache) 32 188892 181810 185943 184837
suffix (w/o cache) 5 149045 152582 153911 153363
suffix (w/o cache) 12 173522 174035 178302 171757
suffix (w/o cache) 32 173522 167821 178697 171921
ngram [5, 5] 5 122885 124817 123925 116898
ngram [5, 5] 12 164000 168710 177866 169000
ngram [5, 5] 32 305025 303489 303603 316235
ngram [3, 5] 5 146892 146052 152542 143307
ngram [3, 5] 12 223238 231225 228872 231770
ngram [3, 5] 32 432295 434561 456818 433020

Total accepted tokens

method spec_len concurrency 1 concurrency 4 concurrency 16 concurrency 64
suffix (w/ cache) 5 104448 107678 105853 112648
suffix (w/ cache) 12 119902 114103 116161 109788
suffix (w/ cache) 32 119902 113189 114991 115209
suffix (w/o cache) 5 101846 105089 106614 104780
suffix (w/o cache) 12 114345 114439 117194 112100
suffix (w/o cache) 32 114345 109405 117543 110273
ngram [5, 5] 5 89233 91067 90410 85974
ngram [5, 5] 12 94002 95939 101547 97922
ngram [5, 5] 32 102083 103021 104049 106095
ngram [3, 5] 5 95830 96248 98966 94171
ngram [3, 5] 12 103658 106170 106975 110182
ngram [3, 5] 32 110953 110166 113630 111404

Older Results (before optimizing)

refactor-bench (out=1024)

Results are mean TPOT (ms)

method spec_len concurrency 1 concurrency 4 concurrency 16 concurrency 64
suffix (w/ cache) 5 2.15 3.68 9.02 26.64
suffix (w/ cache) 12 1.91 3.36 8.56 26.32
suffix (w/ cache) 32 1.81 3.22 8.58 26.78
suffix (w/o cache) 5 2.35 3.92 9.2 26.78
suffix (w/o cache) 12 2.13 3.65 8.92 26.68
suffix (w/o cache) 32 2.04 3.56 8.98 27.77
ngram 5 2.99 4.7 10.41 28.62
ngram 12 2.68 4.41 9.85 28.66
ngram 32 2.58 4.32 10.57 32.63

spec-bench (out=256)

Results are mean TPOT (ms)

method spec_len concurrency 1 concurrency 4 concurrency 16 concurrency 64
suffix (w/ cache) 5 4.27 4.67 6.17 12.03
suffix (w/ cache) 12 4.26 4.71 6.2 12.11
suffix (w/ cache) 32 4.28 4.73 6.17 12.27
suffix (w/o cache) 5 4.63 5.09 6.38 11.68
suffix (w/o cache) 12 4.63 5.1 6.37 11.62
suffix (w/o cache) 32 4.62 5.06 6.35 11.66
ngram 5 5.38 5.7 6.77 10.98
ngram 12 5.37 5.67 6.76 10.99
ngram 32 5.37 5.73 6.87 11.76

@mergify
Copy link

mergify bot commented Sep 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aurickq.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 26, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates Suffix Decoding from Arctic Inference as a new speculative decoding method. The changes are well-structured, adding new configuration options, validation, and the core logic for proposing draft tokens and managing the suffix cache. My review identifies a potential type inconsistency in the token sequences passed to the arctic-inference library, which could lead to runtime errors. I've suggested a fix to ensure consistency.

@simon-mo
Copy link
Collaborator

@codex review

@simon-mo
Copy link
Collaborator

note to reviewers:

  • We discussed with the Snowflake team that importing from arctic-inference is acceptable path forward and the team is committed in maintaining it as a separate library.
  • Please focus on code quality, interfaces, UX, etc.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@keyboardAnt
Copy link

keyboardAnt commented Sep 26, 2025

@aurickq, thanks for your awesome contribution, the results look good!

Suffix decoding outperforms n-gram at out=1024, but falls behind at out=256 with concurrency=64 (+5.8% in the best case). Any idea why?

@aurickq
Copy link
Contributor Author

aurickq commented Sep 28, 2025

@aurickq, thanks for your awesome contribution, the results look good!

Suffix decoding outperforms n-gram at out=1024, but falls behind at out=256 with concurrency=64 (+5.8% in the best case). Any idea why?

The out=1024 and out=256 are also two different datasets, so might not be very comparable. Other than that, when the concurrency is high and the number of output tokens is low (e.g. 256), the request completion time becomes dominated by mixed-prefill batches that drag up the mean TPOT metric. So it makes sense for these cases the performance of suffix and ngram will approach each other.

As for why suffix becomes a little worse than ngram for spec_bench out=256 and concurrency=64, here is my guess: the SpecBench dataset is more open-ended (higher entropy, less repetition) than refactor-benchmark, so we should already would expect suffix/ngram to perform worse on it. The benchmark is also small (400-500 examples), so suffix decoding might not have built a sufficiently large cache to accurately predict the next tokens. From the benchmarks, the performance of suffix decoding actually is better when this cache is disabled in this setting.

I have some ideas for solving this latter issue when the cached data is sparse, which I might later implement and contribute as a "suffix v2" method, if it works.

@Neo9061
Copy link

Neo9061 commented Sep 29, 2025

Thanks a lot for the contribution @aurickq ! A few questions.

  1. In your benchmarking, when there is cache enabled, is it referring to the global tree? what training data are you using to construct the global tree?
  2. Can we enable an option to make the global tree static which uses some offline training data? as explained in other thread, this will be very useful for multi-tenets requests. Plan to merge Suffix decoding into vLLM mainline? snowflakedb/ArcticInference#171 (comment)
  3. Can your PR work with the hybrid PR [Spec Decode][Hybrid] Add ngram-eagle SD method #24344 where they enable n-gram and EAGLE? such that we can hybrid suffix decoding and eagle?
  4. For the comparison between suffix decoding w/o cache and n-gram, what do you think of the reason to make the suffix decoding w/o cache working better than n-gram? In my understanding, they are almost equivalent when suffix decoding does not use global cache. One of reason I could think of is the dynamic drafting length suffix decoding has over the n-gram.

@aurickq
Copy link
Contributor Author

aurickq commented Sep 29, 2025

@Neo9061

  1. "w/ cache" means using the global suffix tree, and "w/o cache" means not using the global suffix tree (setting suffix_decoding_max_cached_requests = 0. The per-prompt suffix trees are used in both cases. In these benchmarks, the only requests being cached are the earlier requests in the same benchmark. The performance would probably be much better in a more realistic setting when more requests can be cached over a longer period of time.
  2. I think this is a good idea, but I would like to address this in a follow-up PR once the core suffix speculation is enabled. It could use more input from the community on interface design, like what's the best format to read the "static" cache.
  3. The current PR doesn't consider hybrid speculation yet, would also be good to add in the future.
  4. Yeah they are "almost" equivalent except for suffix decoding's frequency stats and scoring mechanism. For each speculation length, suffix decoding can speculate up to that many tokens but can also speculate less if there is no probable continuation to save on verification costs. It also means that out of several possible continuations, suffix decoding can choose the most "frequent" one to maximize the probability of acceptance.

@mergify mergify bot added the ci/build label Sep 29, 2025
Comment on lines +42 to +60
draft_token_ids: list[list[int]] = []
for i, sampled_ids in enumerate(sampled_token_ids):
if not sampled_ids:
# Skip speculative decoding for partial prefills.
draft_token_ids.append([])
continue

# Skip requests that require sampling parameters that are not
# supported with speculative decoding.
req_id = input_batch.req_ids[i]
if req_id in input_batch.spec_decode_unsupported_reqs:
draft_token_ids.append([])
continue

num_tokens = input_batch.num_tokens_no_spec[i]
if num_tokens >= self.max_model_len:
# Skip requests that have already reached the max model length.
draft_token_ids.append([])
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I might forget to flush one of my previous comment. Seems there's quite some duplicated code here from NgramProposer. I'm wondering if we should come up with some ModelFreeProposer class, and put the common logic here.

Ideally, that would make the future extensions easier as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems currently it's just the three continue statements that overlap, which is a pretty small part of the NgramProposer

Copy link
Collaborator

@Jialin Jialin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, as the logic is showing promising results and quite self-contained which doesn't affect majority of use cases.

CC @houseroad for the potential final review

@mergify
Copy link

mergify bot commented Oct 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aurickq.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 22, 2025
@mergify mergify bot removed the needs-rebase label Oct 24, 2025
@aurickq
Copy link
Contributor Author

aurickq commented Oct 24, 2025

Rebased. Could someone help trigger CI?

@Jialin
Copy link
Collaborator

Jialin commented Oct 25, 2025

Rebased. Could someone help trigger CI?

@aurickq Could you try to address the DCO and doc build failure first?

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 25, 2025
@aurickq
Copy link
Contributor Author

aurickq commented Oct 27, 2025

@aurickq Could you try to address the DCO and doc build failure first?

fixed the doc failure. for dco in the past i've avoided addressing this since it leaks my personal email publicly :) (not sure if this part changed)

@simon-mo simon-mo merged commit 2c19d96 into vllm-project:main Nov 3, 2025
88 of 91 checks passed
zhaozuy pushed a commit to zhaozuy/vllm that referenced this pull request Nov 4, 2025
omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request Nov 4, 2025
@ggg-s
Copy link

ggg-s commented Nov 6, 2025

@aurickq Why would you use this parameter --no-enable-prefix-caching ?

juliendenize pushed a commit to juliendenize/vllm that referenced this pull request Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.