[Perf] Async Scheduling + Speculative Decoding + Structured Outputs#29821
Merged
njhill merged 17 commits intovllm-project:mainfrom Jan 6, 2026
Merged
[Perf] Async Scheduling + Speculative Decoding + Structured Outputs#29821njhill merged 17 commits intovllm-project:mainfrom
njhill merged 17 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
njhill
approved these changes
Jan 6, 2026
Member
njhill
left a comment
There was a problem hiding this comment.
Thanks @benchislett for the teamwork!
yugong333
pushed a commit
to yugong333/vllm
that referenced
this pull request
Jan 9, 2026
…llm-project#29821) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>
5 tasks
24 tasks
wangxiyuan
pushed a commit
to vllm-project/vllm-ascend
that referenced
this pull request
Jan 13, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>
aipaes
pushed a commit
to aipaes/vllm-ascend
that referenced
this pull request
Jan 15, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>
akh64bit
pushed a commit
to akh64bit/vllm
that referenced
this pull request
Jan 16, 2026
…llm-project#29821) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>
dsuhinin
pushed a commit
to dsuhinin/vllm
that referenced
this pull request
Jan 21, 2026
…llm-project#29821) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
starmountain1997
pushed a commit
to starmountain1997/vllm-ascend
that referenced
this pull request
Jan 31, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>
starmountain1997
pushed a commit
to starmountain1997/vllm-ascend
that referenced
this pull request
Jan 31, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>
ItzDEXX
pushed a commit
to ItzDEXX/vllm
that referenced
this pull request
Feb 19, 2026
…llm-project#29821) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>
ZRJ026
pushed a commit
to ZRJ026/vllm-ascend
that referenced
this pull request
Feb 28, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241
pushed a commit
to maoxx241/vllm-ascend
that referenced
this pull request
Mar 2, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>
ZRJ026
pushed a commit
to ZRJ026/vllm-ascend
that referenced
this pull request
Mar 4, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ
pushed a commit
to LCAIZJ/vllm-ascend
that referenced
this pull request
Mar 7, 2026
### What this PR does / why we need it? Upgrade vllm commit to 0109 (bde38c11df0ea066a740efe9b77fff5418be45df) 1. remove `init_cached_hf_modules ` due to vllm-project/vllm#31786 2. fix spec_decode e2e test due to vllm-project/vllm#29821 break 3. fix `vllm.v1.attention.backends.utils` duo to vllm-project/vllm#31891 4. fix `self.seq_lens - query_lens` on same device due to vllm-project/vllm#31773 5. skip model_runner_v2 e2e test due to `'_OpNamespace' '_C' object has no attribute 'get_cuda_view_from_cpu_tensor'` - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: hfadzxy <starmoon_zhang@163.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
This PR enables structured outputs when using speculative decoding and async scheduling. The solution implemented in this PR is as follows:
execute_model, the scheduler will calltake_draft_token_idswhich will wait for the draft tokens to arrive on the cpu and yield them.In order to accommodate for invalid draft tokens (which do not adhere to the schema), we replace them on the scheduler_output object and pad the rejected tokens with
-1, which will be ignored when filling the bitmasks. To ensure that the model does not sample from a position after an invalid token, we attach a per-request count of invalid tokens onto the grammar output to mask out the sampled positions.Test Plan
GSM8K to check for base-model correctness regressions.
Structured outputs
xgrammar_benchfor speed, and also withjson_uniquefor grammar adherence.Test Result
Passes correctness and structured output adherence tests.
Benchmarking details:
Setup:
Three runs: one with async + spec, one with only spec, one with only async.
Result:
100% coverage of xgrammar_bench in all cases.
Perf measurement (when controlling for output tokens with ignore-eos):
With Structured Outputs
No Structured Outputs