[Bugfix] Fix DP Attention Padding in Dummy Run#34187
Merged
robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom Feb 10, 2026
Merged
[Bugfix] Fix DP Attention Padding in Dummy Run#34187robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom
robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom
Conversation
…unpadded Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
robertgshaw2-redhat
approved these changes
Feb 9, 2026
Contributor
There was a problem hiding this comment.
Code Review
This pull request addresses a crash during distributed dummy runs by correctly handling attention padding. The change in _dummy_run to pass num_tokens_padded to _build_attention_metadata when pad_attn is true is a direct and effective fix for the described issue. This change also aligns the logic with the execute_model function, improving code consistency. The fix is well-contained and correct. I approve of this change.
81e217f
into
vllm-project:main
50 of 51 checks passed
5 tasks
llsj14
pushed a commit
to llsj14/vllm
that referenced
this pull request
Mar 1, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
5 tasks
tunglinwood
pushed a commit
to tunglinwood/vllm
that referenced
this pull request
Mar 4, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Mirror of #34009 by @benchislett — "Maintainers are allowed to edit this pull request." was not enabled on the original PR, so pushing review fixes was not possible.
All credit to @benchislett for the original fix.
Purpose
FIX #32626
FIX #33450
Problem: TRTLLM attention requires that num_decode_tokens be divisible by num_requests. However, during DP we sometimes do a dummy run on one of the workers so they don't get out of sync: it such cases, we pad for attention but the attention metadata builder receives the padded number of requests and unpadded number of tokens.
This leads to a crash where we see
num_decode_tokens==1butnum_requests==2. I tracked this down to_build_attention_metadataonly seeingnum_tokens=num_tokens_unpaddedand havingnum_tokens_padded=None(omitted, default is None) even whenpad_attnis True.I'm not sure if this behaviour is intentional, or what the desired strategy for full-graph attention padding for DP will be. If
num_reqs > num_decode_tokensis supposed to be supported, more investigation will be required to ensure that all backends (such as TRTLLM / FlashInfer) can handle this case. It might be as easy as broadening the check in the assert, or we may need to do some additional slicing and/or padding for some backends. I'm not too sure.Testing
Ran this to launch the server on 4xB200:
and this to reproduce the crash:
It doesn't trigger every single time, but according to my debugger it crashes consistently whenever the DP worker does a padded dummy run for a decode-only iteration (outside of startup).
With this patch, it finishes and the LM_Eval GSM8k score is unchanged (fewshot=5, non-chat-completions, ~80%). But I am not sure if this is safe more broadly or if it is omitted here on purpose.