[Bugfix] Fix DP MTP Dummy Run#35243
Merged
benchislett merged 4 commits intovllm-project:mainfrom Mar 17, 2026
Merged
Conversation
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request addresses a bug that occurs during a Data Parallelism (DP) dummy run when Multi-Token Prediction (MTP) is enabled. The issue stemmed from an incorrect number of tokens being used for the dummy batch, leading to an assertion failure related to token padding. The fix correctly uses 1 + num_speculative_tokens for the dummy run, aligning with the expected padding and resolving the crash. The change is accurate and improves code clarity by replacing a magic number with a descriptive variable.
LucasWilkinson
approved these changes
Feb 26, 2026
Collaborator
LucasWilkinson
left a comment
There was a problem hiding this comment.
Makes sense to me! thanks for the fix!
5 tasks
5 tasks
Member
|
@benchislett could you rebase? I'm not able to update the branch |
Collaborator
|
@njhill just a friendly reminder about this PR. |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
TomerBN-Nvidia
added a commit
to TomerBN-Nvidia/vllm
that referenced
this pull request
Mar 17, 2026
execute_dummy_batch() hardcoded _dummy_run(1, uniform_decode=True), but with MTP speculative decoding the uniform decode query length is 1 + num_speculative_tokens. This caused 'tokens not padded correctly' assertion in split_decodes_and_prefills under multinode DP load. Use uniform_decode_query_len instead of hardcoded 1. Upstream: vllm-project#35243 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lucaskabela
pushed a commit
to Lucaskabela/vllm
that referenced
this pull request
Mar 17, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
andylolu2
pushed a commit
to andylolu2/vllm
that referenced
this pull request
Mar 18, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
wendyliu235
pushed a commit
to wendyliu235/vllm-public
that referenced
this pull request
Mar 18, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
fxdawnn
pushed a commit
to fxdawnn/vllm
that referenced
this pull request
Mar 19, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
FIX #33899
The DP dummy run issues a uniform_batch decode with num_reqs=1, num_tokens=1, and uniform_batch=True. This will pad the number of tokens to max_query_len==(1+num_speculative_tokens) when MTP is enabled.
But when preparing num_scheduled_tokens (and then building query_start_loc), we do this:
So the last value gets overridden to 1. Then this sanity check assert fails in split_decodes_and_prefills:
I'm not sure if it would be correct to instead avoid truncating num_scheduled_tokens, or try to handle the padding differently in the decode split, but an easy workaround is just to run the dummy batch with (1+num_speculative_tokens) in the first place. That is what this PR does.
Test Plan
GSM8k:
Test Result