[Bugfix] Fix DP Attention Padding in Dummy Run by LucasWilkinson · Pull Request #34187 · vllm-project/vllm

LucasWilkinson · 2026-02-09T23:40:35Z

Mirror of #34009 by @benchislett — "Maintainers are allowed to edit this pull request." was not enabled on the original PR, so pushing review fixes was not possible.

All credit to @benchislett for the original fix.

Purpose

FIX #32626
FIX #33450

Problem: TRTLLM attention requires that num_decode_tokens be divisible by num_requests. However, during DP we sometimes do a dummy run on one of the workers so they don't get out of sync: it such cases, we pad for attention but the attention metadata builder receives the padded number of requests and unpadded number of tokens.

This leads to a crash where we see num_decode_tokens==1 but num_requests==2. I tracked this down to _build_attention_metadata only seeing num_tokens=num_tokens_unpadded and having num_tokens_padded=None (omitted, default is None) even when pad_attn is True.

I'm not sure if this behaviour is intentional, or what the desired strategy for full-graph attention padding for DP will be. If num_reqs > num_decode_tokens is supposed to be supported, more investigation will be required to ensure that all backends (such as TRTLLM / FlashInfer) can handle this case. It might be as easy as broadening the check in the assert, or we may need to do some additional slicing and/or padding for some backends. I'm not too sure.

Testing

Ran this to launch the server on 4xB200:

vllm serve microsoft/Phi-mini-MoE-instruct -dp 2 -tp 2 --enable-expert-parallel --attention-backend flashinfer --attention-config.use_trtllm_attention true

and this to reproduce the crash:

vllm bench serve --random-input-len 128 --random-output-len 256 --random-range-ratio 0.5 --max-concurrency 64 --request-rate 5 --num-prompts 128 --num-warmups 0 --model $MODEL --base-url $BASE_URL

It doesn't trigger every single time, but according to my debugger it crashes consistently whenever the DP worker does a padded dummy run for a decode-only iteration (outside of startup).

With this patch, it finishes and the LM_Eval GSM8k score is unchanged (fewshot=5, non-chat-completions, ~80%). But I am not sure if this is safe more broadly or if it is omitted here on purpose.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

…unpadded Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

gemini-code-assist

Code Review

This pull request addresses a crash during distributed dummy runs by correctly handling attention padding. The change in _dummy_run to pass num_tokens_padded to _build_attention_metadata when pad_attn is true is a direct and effective fix for the described issue. This change also aligns the logic with the execute_model function, improving code consistency. The fix is well-contained and correct. I approve of this change.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett and others added 2 commits February 9, 2026 18:39

bugfix

170af9d

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Simplify num_tokens_padded ternary to use None instead of num_tokens_…

89f3d9f

…unpadded Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mergify bot added v1 bug Something isn't working labels Feb 9, 2026

LucasWilkinson mentioned this pull request Feb 9, 2026

[Bugfix] Fix DP Attention Padding in Dummy Run #34009

Closed

robertgshaw2-redhat approved these changes Feb 9, 2026

View reviewed changes

robertgshaw2-redhat enabled auto-merge (squash) February 9, 2026 23:45

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 9, 2026

gemini-code-assist bot reviewed Feb 9, 2026

View reviewed changes

robertgshaw2-redhat merged commit 81e217f into vllm-project:main Feb 10, 2026
50 of 51 checks passed

Selkh mentioned this pull request Feb 10, 2026

[BugFix] Mistakenly passing num_reqs_padded as num_reqs in _dummy_run #34121

Closed

5 tasks

Kurumi5210 mentioned this pull request Mar 3, 2026

[Bugfix] Fix assertion error in flashmla backend with fullgraph enabled #33496

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix DP Attention Padding in Dummy Run#34187

[Bugfix] Fix DP Attention Padding in Dummy Run#34187
robertgshaw2-redhat merged 2 commits intovllm-project:mainfrom
LucasWilkinson:fix-dp-attn-padding-mirror

LucasWilkinson commented Feb 9, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

LucasWilkinson commented Feb 9, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LucasWilkinson commented Feb 9, 2026 •

edited by github-actions bot

Loading