Skip to content

[Attention] Make split_decodes_and_prefills(..., require_uniform=True) support padding#29644

Merged
LucasWilkinson merged 5 commits intovllm-project:mainfrom
neuralmagic:lwilkinson/fix-requires-uniform-case
Dec 9, 2025
Merged

[Attention] Make split_decodes_and_prefills(..., require_uniform=True) support padding#29644
LucasWilkinson merged 5 commits intovllm-project:mainfrom
neuralmagic:lwilkinson/fix-requires-uniform-case

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Nov 28, 2025

with #28579 we pad attention metadata before building; in the case of a uniform decode request we want to make sure that num_decodes matches the cudagraph size so that attention schedulers receive the same batch size as the graph was captured with. Currently we have work around this in all the existing attention backends (e.g. vllm-project/FlashMLA#3) since we used to pad for attention after building attention metadata so this was always the case. But this is needed #27532 since the FlashMLA FP8 Sparse Kernels do not have this workaround yet.

The | (query_lens == 0) is removed from:

        is_prefill = (query_lens > decode_threshold) | (query_lens == 0)

as a small cleanup since this is not actually required (since this is actually not needed given we do actually want to treat them as decades in the case of full-decode batches and if there is a prefill it will come computed as the first prefill so these entries will be ignored anyways.

Test Plan:

CI

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@mergify mergify bot added the v1 label Nov 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to add support for padded requests in split_decodes_and_prefills when require_uniform=True, which is important for cudagraph compatibility. The changes involve updating the splitting logic and adding a corresponding test case. While the intention is correct, I've found a critical issue in the implementation for detecting padded uniform batches. The current logic can fail if the first request is a padding request, leading to incorrect batch splitting. I've provided a detailed comment with a suggested fix to make the logic more robust. The rest of the changes look good.

@LucasWilkinson LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 28, 2025
Copy link
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LucasWilkinson LucasWilkinson enabled auto-merge (squash) December 9, 2025 05:06
@LucasWilkinson LucasWilkinson merged commit aed8469 into vllm-project:main Dec 9, 2025
50 checks passed
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…e)` support padding (vllm-project#29644)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants