[Attention] Support distinguishing between short extends and decodes by LucasWilkinson · Pull Request #37303 · vllm-project/vllm

LucasWilkinson · 2026-03-17T14:38:15Z

Alternative to #35447, support distinguishing between short-extends/prefills and decodes via batch reordering; the batch order is now:

        decode:        (num_scheduled <= threshold AND is not prefilling)
        short_extend:  (num_scheduled <= threshold AND is chunked prefilling)
        long_extend:   (num_scheduled > threshold AND is chunked prefilling)
        prefill:       (num_computed == 0)   # First chunks

gemini-code-assist

Code Review

The pull request refines the batch reordering logic to categorize requests into four types: decode, short_extend, long_extend, and prefill, based on num_scheduled_tokens, num_computed_tokens, and a newly introduced num_prompt_tokens. This involves updating the MockInputBatch and GPUInputBatch classes to include num_prompt_tokens, adding an is_prefilling flag to CommonAttentionMetadata, and modifying the split_decodes_and_prefills and reorder_batch_to_split_decodes_and_prefills functions to leverage this detailed classification. The mamba_attn backend is updated to utilize this new classification by explicitly setting treat_short_extends_as_decodes=False. Concurrently, legacy logic related to handling 'prefill as decode' in gpu_model_runner.py and mamba_utils.py is removed. The review comments emphasize the critical need to update all usages of the MockInputBatch constructor due to its changed signature and suggest adding a clarifying comment for the REORDER_TEST_CASES dictionary format to improve readability.

gemini-code-assist · 2026-03-17T14:43:38Z

tests/v1/attention/test_batch_reordering.py


 class MockInputBatch:
-    def __init__(self, req_ids, num_computed_tokens_cpu):
+    def __init__(self, req_ids, num_computed_tokens_cpu, num_prompt_tokens):


The MockInputBatch class now requires num_prompt_tokens. Ensure all usages of this class are updated to include this parameter to avoid unexpected behavior or errors. This is a critical change as it affects the instantiation of this mock class throughout the tests.

Suggested change

def __init__(self, req_ids, num_computed_tokens_cpu, num_prompt_tokens):

def __init__(self, req_ids, num_computed_tokens_cpu, num_prompt_tokens):

self.req_ids = req_ids

self.num_computed_tokens_cpu = num_computed_tokens_cpu

self.num_prompt_tokens = num_prompt_tokens

gemini-code-assist · 2026-03-17T14:43:39Z

tests/v1/attention/test_batch_reordering.py



 # Test cases for batch reordering
+# Format: (num_scheduled, num_computed, num_prompt)


Adding a comment to describe the format of the REORDER_TEST_CASES dictionary entries improves readability and maintainability. It's important to clarify what each value in the tuple represents.

Suggested change

# Format: (num_scheduled, num_computed, num_prompt)

# Format: (num_scheduled, num_computed, num_prompt)

REORDER_TEST_CASES = {

benchislett · 2026-03-17T15:10:14Z

does this pass the test added in #35447?

benchislett · 2026-03-17T15:12:12Z

How does this handle "short prefill <= threshold, no context"? It's not extend but is below threshold. Does it get classified as prefill or decode?

LucasWilkinson · 2026-03-17T17:50:29Z

How does this handle "short prefill <= threshold, no context"? It's not extend but is below threshold. Does it get classified as prefill or decode?

"pure prefills" i.e. no-context are always placed at the back, this is for the AMD attention backend

does this pass the test added in #35447?

the first 2 pass the second 2 OOM (I assume because im on H100s); will run it in the CI

benchislett · 2026-03-17T22:11:01Z

Ah, I might need run that test as TP4 instead of TP2 :(

benchislett

LGTM, thanks for the cleanup.

I would like to see that specific test case passing before we merge this, to ensure that the nemotron-h-mtp-chunkedprefill case is covered

benchislett · 2026-03-17T22:18:29Z

vllm/v1/worker/gpu_model_runner.py

            slot_mapping_attn = slot_mappings[attn_gid]
            self.slot_mapping = slot_mapping_attn[:num_tokens].cpu().numpy()
+        # Compute is_prefilling: True if request is still in prefill phase
+        # (num_computed_tokens < num_prompt_tokens). Used by mamba backends to


nit: non-mamba-specific logic is more general and consistent with other comments

Suggested change

# (num_computed_tokens < num_prompt_tokens). Used by mamba backends to

# (num_computed_tokens < num_prompt_tokens). Used by some backends to

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

…llm-project#37303) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mergify bot added the v1 label Mar 17, 2026

LucasWilkinson marked this pull request as ready for review March 17, 2026 14:41

LucasWilkinson requested review from MatthewBonanni, WoosukKwon, alexm-redhat, njhill, youkaichao and zhuohan123 as code owners March 17, 2026 14:41

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 17, 2026

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

benchislett approved these changes Mar 17, 2026

View reviewed changes

benchislett reviewed Mar 17, 2026

View reviewed changes

mergify bot added the ci/build label Mar 17, 2026

LucasWilkinson added 4 commits March 18, 2026 18:51

4-way batch reorder

ff164d8

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

5940dfb

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

add test to CI

83489de

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

add assert

f31161d

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson force-pushed the nemotron-h-mtp-4way-batch-split branch from 8ab178f to f31161d Compare March 18, 2026 18:52

LucasWilkinson merged commit e1d85e5 into vllm-project:main Mar 20, 2026
61 of 62 checks passed

chooper26 pushed a commit to intellistream/vllm-hust that referenced this pull request Mar 21, 2026

[Attention] Support distinguishing between short extends and decodes (v…

34c56b2

…llm-project#37303) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

AndreasKaratzas mentioned this pull request Mar 21, 2026

[Bugfix] Fix pooling non-determinism from pinned prompt_lens aliasing #37775

Merged

1 task

noooop mentioned this pull request Mar 22, 2026

[Model] Deprecate the score task (this will not affect users). #37537

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Attention] Support distinguishing between short extends and decodes#37303

[Attention] Support distinguishing between short extends and decodes#37303
LucasWilkinson merged 4 commits intovllm-project:mainfrom
neuralmagic:nemotron-h-mtp-4way-batch-split

LucasWilkinson commented Mar 17, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Uh oh!

gemini-code-assist bot Mar 17, 2026

Uh oh!

benchislett commented Mar 17, 2026

Uh oh!

benchislett commented Mar 17, 2026

Uh oh!

LucasWilkinson commented Mar 17, 2026 •

edited

Loading

Uh oh!

benchislett commented Mar 17, 2026 •

edited

Loading

Uh oh!

benchislett left a comment

Uh oh!

benchislett Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    def __init__(self, req_ids, num_computed_tokens_cpu, num_prompt_tokens):
+    def __init__(self, req_ids, num_computed_tokens_cpu, num_prompt_tokens):
+        self.req_ids = req_ids
+        self.num_computed_tokens_cpu = num_computed_tokens_cpu
+        self.num_prompt_tokens = num_prompt_tokens



		# Test cases for batch reordering
		# Format: (num_scheduled, num_computed, num_prompt)

	# Format: (num_scheduled, num_computed, num_prompt)
	# Format: (num_scheduled, num_computed, num_prompt)
	REORDER_TEST_CASES = {

	# (num_computed_tokens < num_prompt_tokens). Used by mamba backends to
	# (num_computed_tokens < num_prompt_tokens). Used by some backends to

Uh oh!

Conversation

LucasWilkinson commented Mar 17, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

benchislett commented Mar 17, 2026

Uh oh!

benchislett commented Mar 17, 2026

Uh oh!

LucasWilkinson commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchislett commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

benchislett Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LucasWilkinson commented Mar 17, 2026 •

edited by github-actions bot

Loading

LucasWilkinson commented Mar 17, 2026 •

edited

Loading

benchislett commented Mar 17, 2026 •

edited

Loading