Skip to content

[Perf] Async Scheduling + Speculative Decoding + Structured Outputs#29821

Merged
njhill merged 17 commits intovllm-project:mainfrom
CentML:async-spec-struct
Jan 6, 2026
Merged

[Perf] Async Scheduling + Speculative Decoding + Structured Outputs#29821
njhill merged 17 commits intovllm-project:mainfrom
CentML:async-spec-struct

Conversation

@benchislett
Copy link
Collaborator

@benchislett benchislett commented Dec 1, 2025

Purpose

This PR enables structured outputs when using speculative decoding and async scheduling. The solution implemented in this PR is as follows:

  • After computing the draft tokens, asynchronously copy them to the CPU. Overlap this copy with the next model forward step, on a separate stream.
  • Concurrently, after calling the next execute_model, the scheduler will call take_draft_token_ids which will wait for the draft tokens to arrive on the cpu and yield them.
  • The scheduler uses the draft tokens to generate the bitmask for the batch (including spec tokens. still concurrent with execute_model), and then calls sample_tokens as before.

In order to accommodate for invalid draft tokens (which do not adhere to the schema), we replace them on the scheduler_output object and pad the rejected tokens with -1, which will be ignored when filling the bitmasks. To ensure that the model does not sample from a position after an invalid token, we attach a per-request count of invalid tokens onto the grammar output to mask out the sampled positions.

Test Plan

GSM8K to check for base-model correctness regressions.

Structured outputs xgrammar_bench for speed, and also with json_unique for grammar adherence.

Test Result

Passes correctness and structured output adherence tests.

Benchmarking details:

Setup:

vllm serve meta-llama/Llama-3.1-8B-Instruct --speculative-config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 3}' --async-scheduling --max-num-seqs 128 --no-enable-prefix-caching

Three runs: one with async + spec, one with only spec, one with only async.

Result:

100% coverage of xgrammar_bench in all cases.
Perf measurement (when controlling for output tokens with ignore-eos):

With Structured Outputs

  • Async Scheduling (1.00x E2E)
============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  4.71      
Total input tokens:                      30919     
Total generated tokens:                  9569      
Request throughput (req/s):              21.25     
Output token throughput (tok/s):         2033.52   
Total Token throughput (tok/s):          8604.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          46.46     
Median TTFT (ms):                        49.31     
P99 TTFT (ms):                           70.04     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.44      
Median TPOT (ms):                        4.42      
P99 TPOT (ms):                           4.73      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.25      
Median ITL (ms):                         4.08      
P99 ITL (ms):                            10.61     
==================================================
  • Spec Decoding (1.38x)
============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  3.40      
Total input tokens:                      30919     
Total generated tokens:                  9574      
Request throughput (req/s):              29.41     
Output token throughput (tok/s):         2815.90   
Total Token throughput (tok/s):          11909.78  
---------------Time to First Token----------------
Mean TTFT (ms):                          27.96     
Median TTFT (ms):                        23.60     
P99 TTFT (ms):                           71.90     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          3.11      
Median TPOT (ms):                        3.14      
P99 TPOT (ms):                           4.10      
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.97      
Median ITL (ms):                         6.35      
P99 ITL (ms):                            13.68     
==================================================
  • Async Scheduling + Spec Decoding (1.62x)
============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  2.91      
Total input tokens:                      30919     
Total generated tokens:                  9570      
Request throughput (req/s):              34.32     
Output token throughput (tok/s):         3284.67   
Total Token throughput (tok/s):          13896.85  
---------------Time to First Token----------------
Mean TTFT (ms):                          35.58     
Median TTFT (ms):                        31.50     
P99 TTFT (ms):                           71.67     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.56      
Median TPOT (ms):                        2.60      
P99 TPOT (ms):                           3.52      
---------------Inter-token Latency----------------
Mean ITL (ms):                           5.76      
Median ITL (ms):                         5.09      
P99 ITL (ms):                            14.75     
==================================================

No Structured Outputs

  • Async Scheduling (1.00x E2E)
============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  4.55      
Total input tokens:                      30919     
Total generated tokens:                  9645      
Request throughput (req/s):              21.99     
Output token throughput (tok/s):         2121.08   
Total Token throughput (tok/s):          8920.62   
---------------Time to First Token----------------
Mean TTFT (ms):                          39.46     
Median TTFT (ms):                        41.62     
P99 TTFT (ms):                           56.57     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          4.34      
Median TPOT (ms):                        4.33      
P99 TPOT (ms):                           4.69      
---------------Inter-token Latency----------------
Mean ITL (ms):                           4.18      
Median ITL (ms):                         4.08      
P99 ITL (ms):                            5.13      
==================================================
  • Spec Decoding (1.56x)
============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  2.91      
Total input tokens:                      30919     
Total generated tokens:                  9661      
Request throughput (req/s):              34.42     
Output token throughput (tok/s):         3324.92   
Total Token throughput (tok/s):          13965.99  
---------------Time to First Token----------------
Mean TTFT (ms):                          24.56     
Median TTFT (ms):                        21.39     
P99 TTFT (ms):                           51.44     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.66      
Median TPOT (ms):                        2.69      
P99 TPOT (ms):                           3.68      
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.21      
Median ITL (ms):                         5.52      
P99 ITL (ms):                            13.48     
==================================================
  • Async Scheduling + Spec Decoding (1.66x)
============ Serving Benchmark Result ============
Successful requests:                     100       
Maximum request concurrency:             10        
Benchmark duration (s):                  2.74      
Total input tokens:                      30919     
Total generated tokens:                  9661      
Request throughput (req/s):              36.45     
Output token throughput (tok/s):         3521.31   
Total Token throughput (tok/s):          14790.89  
---------------Time to First Token----------------
Mean TTFT (ms):                          30.94     
Median TTFT (ms):                        29.38     
P99 TTFT (ms):                           53.58     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.43      
Median TPOT (ms):                        2.44      
P99 TPOT (ms):                           3.31      
---------------Inter-token Latency----------------
Mean ITL (ms):                           5.68      
Median ITL (ms):                         5.03      
P99 ITL (ms):                            15.35     
==================================================

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@mergify mergify bot added the tpu Related to Google TPUs label Dec 1, 2025
@mergify
Copy link

mergify bot commented Dec 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 1, 2025
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@mergify mergify bot removed the needs-rebase label Dec 2, 2025
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@mergify mergify bot removed the tpu Related to Google TPUs label Dec 2, 2025
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 6, 2026
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @benchislett for the teamwork!

@njhill njhill enabled auto-merge (squash) January 6, 2026 18:23
@njhill njhill merged commit f7008ce into vllm-project:main Jan 6, 2026
48 checks passed
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026
…llm-project#29821)

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Jan 13, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
aipaes pushed a commit to aipaes/vllm-ascend that referenced this pull request Jan 15, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
…llm-project#29821)

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…llm-project#29821)

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…llm-project#29821)

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
### What this PR does / why we need it?
Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df)

1. remove `init_cached_hf_modules ` due to
vllm-project/vllm#31786
2. fix spec_decode e2e test due to
vllm-project/vllm#29821 break
3. fix `vllm.v1.attention.backends.utils` duo to
vllm-project/vllm#31891
4. fix `self.seq_lens - query_lens` on same device due to
vllm-project/vllm#31773
5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has
no attribute 'get_cuda_view_from_cpu_tensor'`

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed structured-output v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants