Skip to content

[Bugfix] Fix DP Attention Padding in Dummy Run#34187

Merged
robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom
LucasWilkinson:fix-dp-attn-padding-mirror
Feb 10, 2026
Merged

[Bugfix] Fix DP Attention Padding in Dummy Run#34187
robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom
LucasWilkinson:fix-dp-attn-padding-mirror

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Feb 9, 2026

Mirror of #34009 by @benchislett — "Maintainers are allowed to edit this pull request." was not enabled on the original PR, so pushing review fixes was not possible.

All credit to @benchislett for the original fix.

Purpose

FIX #32626
FIX #33450

Problem: TRTLLM attention requires that num_decode_tokens be divisible by num_requests. However, during DP we sometimes do a dummy run on one of the workers so they don't get out of sync: it such cases, we pad for attention but the attention metadata builder receives the padded number of requests and unpadded number of tokens.

This leads to a crash where we see num_decode_tokens==1 but num_requests==2. I tracked this down to _build_attention_metadata only seeing num_tokens=num_tokens_unpadded and having num_tokens_padded=None (omitted, default is None) even when pad_attn is True.

I'm not sure if this behaviour is intentional, or what the desired strategy for full-graph attention padding for DP will be. If num_reqs > num_decode_tokens is supposed to be supported, more investigation will be required to ensure that all backends (such as TRTLLM / FlashInfer) can handle this case. It might be as easy as broadening the check in the assert, or we may need to do some additional slicing and/or padding for some backends. I'm not too sure.

Testing

Ran this to launch the server on 4xB200:

vllm serve microsoft/Phi-mini-MoE-instruct -dp 2 -tp 2 --enable-expert-parallel --attention-backend flashinfer --attention-config.use_trtllm_attention true

and this to reproduce the crash:

vllm bench serve --random-input-len 128 --random-output-len 256 --random-range-ratio 0.5 --max-concurrency 64 --request-rate 5 --num-prompts 128 --num-warmups 0 --model $MODEL --base-url $BASE_URL

It doesn't trigger every single time, but according to my debugger it crashes consistently whenever the DP worker does a padded dummy run for a decode-only iteration (outside of startup).

With this patch, it finishes and the LM_Eval GSM8k score is unchanged (fewshot=5, non-chat-completions, ~80%). But I am not sure if this is safe more broadly or if it is omitted here on purpose.

benchislett and others added 2 commits February 9, 2026 18:39
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
…unpadded

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@mergify mergify bot added v1 bug Something isn't working labels Feb 9, 2026
@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) February 9, 2026 23:45
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 9, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a crash during distributed dummy runs by correctly handling attention padding. The change in _dummy_run to pass num_tokens_padded to _build_attention_metadata when pad_attn is true is a direct and effective fix for the described issue. This change also aligns the logic with the execute_model function, improving code consistency. The fix is well-contained and correct. I approve of this change.

@robertgshaw2-redhat robertgshaw2-redhat merged commit 81e217f into vllm-project:main Feb 10, 2026
50 of 51 checks passed
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Attention Assertion [Bug]: TRTLLM Attention Failure with DP/EP

3 participants