[Perf] Async Scheduling + Speculative Decoding + Structured Outputs by benchislett · Pull Request #29821 · vllm-project/vllm

benchislett · 2025-12-01T22:51:53Z

Purpose

This PR enables structured outputs when using speculative decoding and async scheduling. The solution implemented in this PR is as follows:

After computing the draft tokens, asynchronously copy them to the CPU. Overlap this copy with the next model forward step, on a separate stream.
Concurrently, after calling the next execute_model, the scheduler will call take_draft_token_ids which will wait for the draft tokens to arrive on the cpu and yield them.
The scheduler uses the draft tokens to generate the bitmask for the batch (including spec tokens. still concurrent with execute_model), and then calls sample_tokens as before.

In order to accommodate for invalid draft tokens (which do not adhere to the schema), we replace them on the scheduler_output object and pad the rejected tokens with -1, which will be ignored when filling the bitmasks. To ensure that the model does not sample from a position after an invalid token, we attach a per-request count of invalid tokens onto the grammar output to mask out the sampled positions.

Test Plan

GSM8K to check for base-model correctness regressions.

Structured outputs xgrammar_bench for speed, and also with json_unique for grammar adherence.

Test Result

Passes correctness and structured output adherence tests.

Benchmarking details:

Setup:

vllm serve meta-llama/Llama-3.1-8B-Instruct --speculative-config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 3}' --async-scheduling --max-num-seqs 128 --no-enable-prefix-caching

Three runs: one with async + spec, one with only spec, one with only async.

Result:

100% coverage of xgrammar_bench in all cases.
Perf measurement (when controlling for output tokens with ignore-eos):

With Structured Outputs

Async Scheduling (1.00x E2E)

============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  4.71      
Total input tokens:                      30919     
Total generated tokens:                  9569      
Request throughput (req/s):              21.25     
Output token throughput (tok/s):         2033.52   
Total Token throughput (tok/s):          8604.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          46.46     
Median TTFT (ms):                        49.31     
P99 TTFT (ms):                           70.04     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.44      
Median TPOT (ms):                        4.42      
P99 TPOT (ms):                           4.73      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.25      
Median ITL (ms):                         4.08      
P99 ITL (ms):                            10.61     
==================================================

Spec Decoding (1.38x)

============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  3.40      
Total input tokens:                      30919     
Total generated tokens:                  9574      
Request throughput (req/s):              29.41     
Output token throughput (tok/s):         2815.90   
Total Token throughput (tok/s):          11909.78  
---------------Time to First Token----------------
Mean TTFT (ms):                          27.96     
Median TTFT (ms):                        23.60     
P99 TTFT (ms):                           71.90     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.11      
Median TPOT (ms):                        3.14      
P99 TPOT (ms):                           4.10      
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.97      
Median ITL (ms):                         6.35      
P99 ITL (ms):                            13.68     
==================================================

Async Scheduling + Spec Decoding (1.62x)

============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  2.91      
Total input tokens:                      30919     
Total generated tokens:                  9570      
Request throughput (req/s):              34.32     
Output token throughput (tok/s):         3284.67   
Total Token throughput (tok/s):          13896.85  
---------------Time to First Token----------------
Mean TTFT (ms):                          35.58     
Median TTFT (ms):                        31.50     
P99 TTFT (ms):                           71.67     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.56      
Median TPOT (ms):                        2.60      
P99 TPOT (ms):                           3.52      
---------------Inter-token Latency----------------
Mean ITL (ms):                           5.76      
Median ITL (ms):                         5.09      
P99 ITL (ms):                            14.75     
==================================================

No Structured Outputs

Async Scheduling (1.00x E2E)

============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  4.55      
Total input tokens:                      30919     
Total generated tokens:                  9645      
Request throughput (req/s):              21.99     
Output token throughput (tok/s):         2121.08   
Total Token throughput (tok/s):          8920.62   
---------------Time to First Token----------------
Mean TTFT (ms):                          39.46     
Median TTFT (ms):                        41.62     
P99 TTFT (ms):                           56.57     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.34      
Median TPOT (ms):                        4.33      
P99 TPOT (ms):                           4.69      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.18      
Median ITL (ms):                         4.08      
P99 ITL (ms):                            5.13      
==================================================

Spec Decoding (1.56x)

============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  2.91      
Total input tokens:                      30919     
Total generated tokens:                  9661      
Request throughput (req/s):              34.42     
Output token throughput (tok/s):         3324.92   
Total Token throughput (tok/s):          13965.99  
---------------Time to First Token----------------
Mean TTFT (ms):                          24.56     
Median TTFT (ms):                        21.39     
P99 TTFT (ms):                           51.44     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.66      
Median TPOT (ms):                        2.69      
P99 TPOT (ms):                           3.68      
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.21      
Median ITL (ms):                         5.52      
P99 ITL (ms):                            13.48     
==================================================

Async Scheduling + Spec Decoding (1.66x)

============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  2.74      
Total input tokens:                      30919     
Total generated tokens:                  9661      
Request throughput (req/s):              36.45     
Output token throughput (tok/s):         3521.31   
Total Token throughput (tok/s):          14790.89  
---------------Time to First Token----------------
Mean TTFT (ms):                          30.94     
Median TTFT (ms):                        29.38     
P99 TTFT (ms):                           53.58     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.43      
Median TPOT (ms):                        2.44      
P99 TPOT (ms):                           3.31      
---------------Inter-token Latency----------------
Mean ITL (ms):                           5.68      
Median ITL (ms):                         5.03      
P99 ITL (ms):                            15.35     
==================================================

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify · 2025-12-01T22:52:36Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

…ec-struct

njhill

Thanks @benchislett for the teamwork!

…llm-project#29821) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>

…llm-project#29821) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>

…llm-project#29821) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>

…llm-project#29821) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>

### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>

benchislett added 2 commits November 28, 2025 21:07

WIP

26c8a2f

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

WIP - functional

7766f20

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify bot added structured-output v1 labels Dec 1, 2025

github-project-automation bot added this to Structured Output Dec 1, 2025

mergify bot added the tpu Related to Google TPUs label Dec 1, 2025

mergify bot added the needs-rebase label Dec 1, 2025

benchislett added 3 commits December 2, 2025 19:31

revert changes to processor.py

94fc4a4

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Merge branch 'main' into async-spec-struct

9af0d78

update input_processor.py

5caf84e

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify bot removed the needs-rebase label Dec 2, 2025

refactor some more

bc24d53

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify bot removed the tpu Related to Google TPUs label Dec 2, 2025

benchislett added 2 commits December 4, 2025 18:28

tidy

f98d339

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Merge branch 'main' into async-spec-struct

01d9c4e

benchislett marked this pull request as ready for review December 4, 2025 21:53

benchislett requested review from ApostaC, WoosukKwon, aarnphm, alexm-redhat, heheda12345, mgoin, njhill, robertgshaw2-redhat, russellb and ywang96 as code owners December 4, 2025 21:53

Merge remote-tracking branch 'njhill/async-spec-struct' into async-sp…

baa2926

…ec-struct

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 6, 2026

njhill approved these changes Jan 6, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into async-spec-struct

e43a89a

njhill mentioned this pull request Jan 6, 2026

[Core] Optimize expensive deepcopy in GPU model runner #31723

Closed

njhill enabled auto-merge (squash) January 6, 2026 18:23

njhill merged commit f7008ce into vllm-project:main Jan 6, 2026
48 checks passed

github-project-automation bot moved this to Done in Structured Output Jan 6, 2026

zhangxinyuehfad mentioned this pull request Jan 7, 2026

[Main2Main] Upgrade vllm commit to 0112 vllm-project/vllm-ascend#5691

Closed

njhill mentioned this pull request Jan 8, 2026

[BugFix] Fix spec decoding edge case bugs #31944

Merged

zhangxinyuehfad mentioned this pull request Jan 8, 2026

[Main2Main] Upgrade vllm commit to 0108 vllm-project/vllm-ascend#5727

Closed

PatchouliTIS mentioned this pull request Jan 9, 2026

[Core] NGram GPU Implementation compatible with Async Scheduler #29184

Merged

5 tasks

njhill mentioned this pull request Jan 9, 2026

[Tracking Issue][Performance]: Speculative decoding performance/QoL improvements #28947

Open

24 tasks

zhangxinyuehfad mentioned this pull request Jan 13, 2026

[Main2Main] Upgrade vllm commit to 0109 vllm-project/vllm-ascend#5752

Merged

rain2bow mentioned this pull request Jan 13, 2026

feature: spec decoding 相关问题讨论 baidu/vLLM-Kunlun#107

Open

wjunLu mentioned this pull request Jan 13, 2026

[Main2Main] Upgrade vllm commit to 0113 vllm-project/vllm-ascend#5839

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Async Scheduling + Speculative Decoding + Structured Outputs#29821

[Perf] Async Scheduling + Speculative Decoding + Structured Outputs#29821
njhill merged 17 commits intovllm-project:mainfrom
CentML:async-spec-struct

benchislett commented Dec 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Dec 1, 2025

Uh oh!

njhill left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

benchislett commented Dec 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Setup:

Result:

With Structured Outputs

No Structured Outputs

Uh oh!

mergify bot commented Dec 1, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benchislett commented Dec 1, 2025 •

edited by github-actions bot

Loading