[Core] Concurrent partial prefills for V1#31330
[Core] Concurrent partial prefills for V1#31330ppppqp wants to merge 16 commits intovllm-project:mainfrom
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request implements concurrent partial prefills for V1, which is a significant feature for improving throughput and latency. The changes introduce new configuration options and complex scheduling logic. The implementation looks mostly solid, but I've found a critical typo in the configuration validation that would lead to a runtime error.
| class PrefillState: | ||
| """Lightweight state used to reason about a request's prefill status.""" | ||
|
|
||
| # whether the request in in prefill phase | ||
| is_prefill: bool | ||
| # number of remaining tokens to prefill | ||
| remaining_tokens: int | ||
| # whether the prefill is considered a long prefill | ||
| is_long_prefill: bool | ||
|
|
There was a problem hiding this comment.
Adding a new dataclass here as an abstraction, in case in future we want to implement more sophisticated cocurrency strategy (like the strategy to determine what is a long prefill)
|
Hi @ppppqp, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| request, request.num_computed_tokens | ||
| ) | ||
| if prefill_state.is_prefill: | ||
| num_new_tokens = min(num_new_tokens, partial_prefill_slot_budget) |
There was a problem hiding this comment.
Cap the computed num_new_tokens with the allocated budget. This ensures that all requests that are granted a partial prefill slot can progress.
| def _is_prefill_with_tokens(request: Request, num_computed_tokens: int) -> bool: | ||
| """Check if the request is in the prefill phase""" | ||
| return ( | ||
| request.num_output_tokens == 0 | ||
| and num_computed_tokens < request.num_prompt_tokens | ||
| ) | ||
|
|
||
| @staticmethod | ||
| def _remaining_prefill_tokens_with_tokens( | ||
| request: Request, num_computed_tokens: int | ||
| ) -> int: | ||
| """Get the number of remaining prefill tokens""" | ||
| return max(request.num_prompt_tokens - num_computed_tokens, 0) |
There was a problem hiding this comment.
Not sure about these two helper function. Could you take a look and see if my understanding is correct?
a8b17dc to
bb38145
Compare
|
Hi @ppppqp, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
dc6c074 to
e5204f9
Compare
|
Hi @ppppqp, could you provide some benchmarks, for example under 16k / 32k / 64k ISL scenarios? |
310c3c4 to
57f2517
Compare
|
I believe we already support this functionality in V1. Specifically, the |
Hi! Are you talking about this PR? The Reuqest 1~4 is still gonna get all budgets for each run (512 * 4), and Request 5 will be starved. |
Yes, this PR will actually enable |
|
Hi @ppppqp, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @ppppqp, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
@chaunceyjiang |
fee590d to
e3e3e0d
Compare
i.e. |
@chaunceyjiang If you intentionally need this benchmark of uniform distribution, I can also run the benchmark and provide the stats! |
Signed-off-by: Qiping Pan <panqiping@outlook.com>
Signed-off-by: Qiping Pan <panqiping@outlook.com>
Signed-off-by: Qiping Pan <panqiping@outlook.com>
Signed-off-by: Qiping Pan <panqiping@outlook.com>
Signed-off-by: Qiping Pan <panqiping@outlook.com>
Signed-off-by: Qiping Pan <panqiping@outlook.com>
Signed-off-by: Qiping Pan <panqiping@outlook.com>
Signed-off-by: Qiping Pan <panqiping@outlook.com>
Signed-off-by: Qiping Pan <panqiping@outlook.com>
Signed-off-by: Qiping Pan <panqiping@outlook.com>
Signed-off-by: Qiping Pan <panqiping@outlook.com>
Signed-off-by: Qiping Pan <panqiping@outlook.com>
e3e3e0d to
7925874
Compare
Signed-off-by: Qiping Pan <panqiping@outlook.com>
|
@chaunceyjiang
Please let me know if I need to test on |
|
This pull request has merge conflicts that must be resolved before it can be |
|
I personally really like Concurrent partial prefills and am looking forward to this PR landing. |

Purpose
Implements #14003, since we decide to include it in v1.
Referencing the implementation of the original PR: #10235
In short, concurrent partial prefill technique limits the number of large requests from starving small requests, improving throughput and TTFT overall. There are three key parameters for this strategy:
max_num_partial_prefills: this controls how many requests that are guaranteed to progress in this run. At the start of each scheduler run, we make speculation and distribute the budget to each partial prefill slots evenly. The budget distribution makes sure that at leastmax_num_partial_prefillscan get new tokens prefilled. If some request does not use all budgets distributed to it, it can happen that there are some budgets left, so more requests can be served.max_long_partial_prefills: this controls how many large requests can we put in the partial prefill slots. For example, ifmax_num_partial_prefills=4 & max_long_partial_prefills=2, we can have 2 large requests and 2 small requests each run. The default number is1.long_prefill_token_threshold: this controls the criteria for determination of "large request". If the number of token of the prompt exeeds the threshold, it is considered large request and will be limited bymax_long_partial_prefills.Test Plan
Unit test with parity to the original PR.
I'm not sure about whether we should have parity to this unit test, because based on my testing it seems like the alignment gets abstracted away from scheduler. Would need some help here.
https://github.com/vllm-project/vllm/pull/10235/files#diff-2c6af6e25b8d1074f25ef5ad2901121b30bc1528de74d2b3625636fcb8181624R782-R831
Test Result
Benchmark plan
I followed the setup of the original PR with custom dataset I generated from shareGPT creative writing dataset. The distribution of token count is shown below, showing a three groups of small/medium/large prompts:

I tested three versions on the dataset with A40, with the dataset: benchmark-final.jsonl.zip
mainbranchmax_num_partial_prefills=1max_num_partial_prefills=4 & long_prefill_token_threshold=2048For each version, I also tested with
output_num=128(which is default) andoutput_num=1.Sorry that the chart is probably not organized in the clearest way. Please compare the stats in greyed column together and in white column together.

Some interesting observation:
output_len=128experiment group is not significant (~20%). I did some investigation and I think it's because the decoding phase largely averaged out the TTFT since even after the small requests get prefilled, they still need to be in the queue for fairly large number of times, and therefore capped bymax_num_seqswhich is 128 by default. If we only consider the prefilling phase (i.e. we setoutput_lento be 1), the TTFT improvement is significant (~400%). I did an extra experiment to further confirm this issue (in the last column of the chart, wheremax_num_seqs=1024)max_num_prefills(1->4->16), we increase the throughput pretty consistently.Considerations
Some best practice suggestion around this feature:
long_prefill_token_thresholdis best used if aroundmax_num_batched_tokens. If it's too high away, the single large request can still starve the queue. If it's lower, then the throughput for large requests get degraded quickly. In this case, if a large request is the only request in queue, it does not get full budget of the run.max_num_seqsmust be tuned up in accordance with the throughput improvement.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.