[Continuous Batching] Unified prefilling & decoding prototype#172
Merged
ericcurtin merged 13 commits intovllm-project:mainfrom Mar 20, 2026
Merged
[Continuous Batching] Unified prefilling & decoding prototype#172ericcurtin merged 13 commits intovllm-project:mainfrom
ericcurtin merged 13 commits intovllm-project:mainfrom
Conversation
30 tasks
85b477c to
6d6ddb2
Compare
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
6d6ddb2 to
528cb3a
Compare
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
528cb3a to
7cd4cb6
Compare
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
4 tasks
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
Resolve conflict in model_runner.py: keep STTExecutor class (from vllm-project#173), drop stale MAX_PACKED_PREFILL_TOKENS constant (removed in this branch). Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
ericcurtin
approved these changes
Mar 20, 2026
This was referenced Mar 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Did not touch varlen kernel. Just wire it to Unified prefilling & decoding
model()call per scheduler step viaprepare_unified()+ the varlen paged attention kernelstart_posoffset, instead of re-computing from position 0Post-processing 4 arms after the unified forward pass:
[0..num_decode-1], append token to existing state, incrementgenerated_tokensRequestStatewithgenerated_tokens=1RequestState, transition request from prefill to decode phaseDepreciation of vllm v0-style model runner:
_batched_decode_paged()model_runner.py_prefill_packed_paged()model_runner.py_run_packed_prefill()model_runner.py_prefill_packed_pagedprepare_decode()paged_attention_common.pyprepare_unified()prepare_prefill_packed()paged_attention_common.pyprepare_unified()_metal_kernel_decode_attention()paged_attention.py_metal_kernel_prefill_attention()for everythingAll six were v0-style "phase-separated" functions. They're replaced by two v1-style unified functions:
prepare_unified()and_unified_prefill_decode_paged().Benchmark
I use sonnet dataset this time, with 1024 input and 128 output. This will make the "before this PR" 's O(n²) problem standout.
Results
Full benchmark output
This PR (Paged, Continuous Batching, vllm v1):
Before this PR (Paged, vllm v0):
mlx lm path (None Paged KV):