[Core] Concurrent partial prefills for V1 by ppppqp · Pull Request #31330 · vllm-project/vllm

ppppqp · 2025-12-25T06:46:04Z

Purpose

Implements #14003, since we decide to include it in v1.
Referencing the implementation of the original PR: #10235

In short, concurrent partial prefill technique limits the number of large requests from starving small requests, improving throughput and TTFT overall. There are three key parameters for this strategy:

max_num_partial_prefills: this controls how many requests that are guaranteed to progress in this run. At the start of each scheduler run, we make speculation and distribute the budget to each partial prefill slots evenly. The budget distribution makes sure that at least max_num_partial_prefills can get new tokens prefilled. If some request does not use all budgets distributed to it, it can happen that there are some budgets left, so more requests can be served.
max_long_partial_prefills: this controls how many large requests can we put in the partial prefill slots. For example, if max_num_partial_prefills=4 & max_long_partial_prefills=2, we can have 2 large requests and 2 small requests each run. The default number is 1.
long_prefill_token_threshold: this controls the criteria for determination of "large request". If the number of token of the prompt exeeds the threshold, it is considered large request and will be limited by max_long_partial_prefills.

Test Plan

Unit test with parity to the original PR.
I'm not sure about whether we should have parity to this unit test, because based on my testing it seems like the alignment gets abstracted away from scheduler. Would need some help here.
https://github.com/vllm-project/vllm/pull/10235/files#diff-2c6af6e25b8d1074f25ef5ad2901121b30bc1528de74d2b3625636fcb8181624R782-R831

Test Result

Benchmark plan

I followed the setup of the original PR with custom dataset I generated from shareGPT creative writing dataset. The distribution of token count is shown below, showing a three groups of small/medium/large prompts:

I tested three versions on the dataset with A40, with the dataset: benchmark-final.jsonl.zip

main branch
This branch with max_num_partial_prefills=1
this branch with max_num_partial_prefills=4 & long_prefill_token_threshold=2048
For each version, I also tested with output_num=128(which is default) and output_num=1.

vllm serve NousResearch/Hermes-3-Llama-3.1-8B [--max_num_partial_prefills=4 --long_prefill_token_threshold=2048]


vllm bench serve \
  --model NousResearch/Hermes-3-Llama-3.1-8B \
    --dataset-name custom \
  --dataset-path benchmark.jsonl \
  --num-prompts -1 \
  --metric-percentiles 80,85,90,95,99 \
  --request-rate 12 \
  --disable_shuffle \
  [--output_len=1]

Sorry that the chart is probably not organized in the clearest way. Please compare the stats in greyed column together and in white column together.

Some interesting observation:

The performance boost on TTFT for output_len=128 experiment group is not significant (~20%). I did some investigation and I think it's because the decoding phase largely averaged out the TTFT since even after the small requests get prefilled, they still need to be in the queue for fairly large number of times, and therefore capped by max_num_seqs which is 128 by default. If we only consider the prefilling phase (i.e. we set output_len to be 1), the TTFT improvement is significant (~400%). I did an extra experiment to further confirm this issue (in the last column of the chart, where max_num_seqs=1024)
As we increase max_num_prefills(1->4->16), we increase the throughput pretty consistently.
There's a tradeoff for TTFT P99. The performance gets consistently worse as we increase the throughput.

Considerations

Some best practice suggestion around this feature:

long_prefill_token_threshold is best used if around max_num_batched_tokens. If it's too high away, the single large request can still starve the queue. If it's lower, then the throughput for large requests get degraded quickly. In this case, if a large request is the only request in queue, it does not get full budget of the run.
max_num_seqs must be tuned up in accordance with the throughput improvement.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

chatgpt-codex-connector · 2025-12-25T06:46:09Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

github-actions · 2025-12-25T06:46:14Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request implements concurrent partial prefills for V1, which is a significant feature for improving throughput and latency. The changes introduce new configuration options and complex scheduling logic. The implementation looks mostly solid, but I've found a critical typo in the configuration validation that would lead to a runtime error.

vllm/config/scheduler.py

ppppqp · 2025-12-25T06:48:04Z

vllm/v1/core/sched/scheduler.py

+class PrefillState:
+    """Lightweight state used to reason about a request's prefill status."""
+
+    # whether the request in in prefill phase
+    is_prefill: bool
+    # number of remaining tokens to prefill
+    remaining_tokens: int
+    # whether the prefill is considered a long prefill
+    is_long_prefill: bool
+


Adding a new dataclass here as an abstraction, in case in future we want to implement more sophisticated cocurrency strategy (like the strategy to determine what is a long prefill)

mergify · 2025-12-25T06:49:59Z

Hi @ppppqp, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

ppppqp · 2025-12-25T06:51:44Z

vllm/v1/core/sched/scheduler.py

+                    request, request.num_computed_tokens
+                )
+                if prefill_state.is_prefill:
+                    num_new_tokens = min(num_new_tokens, partial_prefill_slot_budget)


Cap the computed num_new_tokens with the allocated budget. This ensures that all requests that are granted a partial prefill slot can progress.

ppppqp · 2025-12-25T06:57:04Z

vllm/v1/core/sched/scheduler.py

+    def _is_prefill_with_tokens(request: Request, num_computed_tokens: int) -> bool:
+        """Check if the request is in the prefill phase"""
+        return (
+            request.num_output_tokens == 0
+            and num_computed_tokens < request.num_prompt_tokens
+        )
+
+    @staticmethod
+    def _remaining_prefill_tokens_with_tokens(
+        request: Request, num_computed_tokens: int
+    ) -> int:
+        """Get the number of remaining prefill tokens"""
+        return max(request.num_prompt_tokens - num_computed_tokens, 0)


Not sure about these two helper function. Could you take a look and see if my understanding is correct?

mergify · 2025-12-25T07:14:56Z

Hi @ppppqp, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

chaunceyjiang · 2025-12-25T09:58:41Z

Hi @ppppqp, could you provide some benchmarks, for example under 16k / 32k / 64k ISL scenarios?

robertgshaw2-redhat · 2025-12-27T16:03:31Z

I believe we already support this functionality in V1. Specifically, the --long_prefill_token_threshold.

ppppqp · 2025-12-27T16:56:27Z

I believe we already support this functionality in V1. Specifically, the --long_prefill_token_threshold.

Hi! Are you talking about this PR?
If so, I believe it only added the parameter but does not fully resolve the throughput issue. Consider this case:
The budget for each run is 2048, and if we set the threshold to be greater than 2048, the threshold does nothing. If we set threshold to be lower than 2048, say 512, then if we see the requests coming in this order:

Request 1(100k tokens)
Request 2(100k tokens)
Request 3(100k tokens)
Request 4(100k tokens)
Request 5(5 tokens)

The Reuqest 1~4 is still gonna get all budgets for each run (512 * 4), and Request 5 will be starved.

ppppqp · 2025-12-27T23:20:22Z

Yes agreed - so the updated semantics would then limit the number of long prefills running at once?

Yes, this PR will actually enable max_num_partial_prefills and max_long_partial_prefills that guarantees that at least max_num_partial_prefills - max_long_partial_prefills requests are cleared from starvation.

mergify · 2025-12-27T23:27:06Z

Hi @ppppqp, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2025-12-27T23:42:32Z

Hi @ppppqp, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

ppppqp · 2025-12-27T23:47:24Z

Hi @ppppqp, could you provide some benchmarks, for example under 16k / 32k / 64k ISL scenarios?

@chaunceyjiang
Hi! Do you mean the tuning the input_len to these numbers or I should actually generate a dataset to have the input prompt of that length? I'm not sure about whether the benchmark I've done is good enough or there's still need extra effort needed. Thanks!

chaunceyjiang · 2025-12-31T05:58:21Z

Hi! Do you mean the tuning the input_len to these numbers or I should actually generate a dataset to have the input prompt of that length? I'm not sure about whether the benchmark I've done is good enough or there's still need extra effort needed. Thanks!

i.e.

vllm bench serve --model XXXX --random-input-len 2048 --random-output-len 1024  --max-concurrency 120 --num-prompts 480 --port  8990

ppppqp · 2026-01-01T04:21:32Z

Hi! Do you mean the tuning the input_len to these numbers or I should actually generate a dataset to have the input prompt of that length? I'm not sure about whether the benchmark I've done is good enough or there's still need extra effort needed. Thanks!

i.e.
vllm bench serve --model XXXX --random-input-len 2048 --random-output-len 1024  --max-concurrency 120 --num-prompts 480 --port  8990

@chaunceyjiang
Actually, I don't think this makes sense. I tested and if I specify --random-input-len 2048, the lengths of all inputs are 2047, where both cases should perform the same (since it's uniform distribution, partial prefilling should not observe any long request)
Let me know if there's anything else I need to benchmark on🙏.

If you intentionally need this benchmark of uniform distribution, I can also run the benchmark and provide the stats!

Signed-off-by: Qiping Pan <panqiping@outlook.com>

ppppqp · 2026-01-07T07:33:59Z

@chaunceyjiang
Hi! This is the benchmark result using

vllm serve NousResearch/Hermes-3-Llama-3.1-8B [--max_num_partial_prefills=4]

vllm bench serve --model NousResearch/Hermes-3-Llama-3.1-8B --random-input-len 2048 --random-output-len 1024  --max-concurrency 120 --num-prompts 480

Comparing main branch and this branch with max_num_partial_prefills=1, the change in this PR has no performance regression on TTFT
Comparing main branch and this branch with max_num_partial_prefills=4, The change in this PR causes about 1% slowdown on TTFT for uniformly distributed input.

Please let me know if I need to test on --random-input-len 16384/32768/65536. It should take significantly longer to run the benchmark, but happy to do that as well if needed!

mergify · 2026-01-12T19:01:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ppppqp.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

noooop · 2026-01-24T07:31:19Z

I personally really like Concurrent partial prefills and am looking forward to this PR landing.

ppppqp requested review from ApostaC, ProExpertProg, WoosukKwon, alexm-redhat, heheda12345, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners December 25, 2025 06:46

mergify bot added the v1 label Dec 25, 2025

gemini-code-assist bot reviewed Dec 25, 2025

View reviewed changes

vllm/config/scheduler.py Outdated Show resolved Hide resolved

ppppqp commented Dec 25, 2025

View reviewed changes

ppppqp force-pushed the panqp--partial-prefill branch from a8b17dc to bb38145 Compare December 25, 2025 07:05

ppppqp force-pushed the panqp--partial-prefill branch from dc6c074 to e5204f9 Compare December 25, 2025 07:29

ppppqp marked this pull request as draft December 25, 2025 07:30

ppppqp force-pushed the panqp--partial-prefill branch from 310c3c4 to 57f2517 Compare December 27, 2025 00:24

ppppqp changed the title ~~[WIP][Core] Concurrent partial prefills for V1~~ [Core] Concurrent partial prefills for V1 Dec 27, 2025

ppppqp force-pushed the panqp--partial-prefill branch from fee590d to e3e3e0d Compare December 28, 2025 00:22

ppppqp added 14 commits January 6, 2026 19:36

feat: initial commit

af8c996

Signed-off-by: Qiping Pan <panqiping@outlook.com>

refactor: reuse the code to derive request prefill state

7fd3fe5

Signed-off-by: Qiping Pan <panqiping@outlook.com>

fix: add more flag guard

96c9d29

Signed-off-by: Qiping Pan <panqiping@outlook.com>

fix: typo

d8c2da6

Signed-off-by: Qiping Pan <panqiping@outlook.com>

fix: typing issue

febdd86

Signed-off-by: Qiping Pan <panqiping@outlook.com>

fix: a small bug and some unit tests

e501568

Signed-off-by: Qiping Pan <panqiping@outlook.com>

feat: add priority

dd80c2e

Signed-off-by: Qiping Pan <panqiping@outlook.com>

debug

7beeffb

Signed-off-by: Qiping Pan <panqiping@outlook.com>

fix: prioritize logic should only affect prompts in prefilling state

d55cae5

Signed-off-by: Qiping Pan <panqiping@outlook.com>

fix: remove debugging statement

b5ecb50

Signed-off-by: Qiping Pan <panqiping@outlook.com>

fix: should use computed token in prioritization

98ac4d8

Signed-off-by: Qiping Pan <panqiping@outlook.com>

fix: some comments

bd37fbf

Signed-off-by: Qiping Pan <panqiping@outlook.com>

fix: lint issue

0979141

Signed-off-by: Qiping Pan <panqiping@outlook.com>

fix: typing

7925874

Signed-off-by: Qiping Pan <panqiping@outlook.com>

ppppqp force-pushed the panqp--partial-prefill branch from e3e3e0d to 7925874 Compare January 7, 2026 03:37

ppppqp added 2 commits January 6, 2026 19:59

cleanup

ba3cf13

Signed-off-by: Qiping Pan <panqiping@outlook.com>

remove dead code

3de97a4

Signed-off-by: Qiping Pan <panqiping@outlook.com>

mergify bot added the needs-rebase label Jan 12, 2026

sfbemerk mentioned this pull request Feb 26, 2026

[Feature]: max-num-partial-prefills in V1 #21674

Open

1 task

Uh oh!

Conversation

ppppqp commented Dec 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Benchmark plan

Considerations

Uh oh!

chatgpt-codex-connector bot commented Dec 25, 2025

Uh oh!

github-actions bot commented Dec 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ppppqp Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 25, 2025

Uh oh!

ppppqp Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ppppqp Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 25, 2025

Uh oh!

chaunceyjiang commented Dec 25, 2025

Uh oh!

robertgshaw2-redhat commented Dec 27, 2025

Uh oh!

ppppqp commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ppppqp commented Dec 27, 2025

Uh oh!

mergify bot commented Dec 27, 2025

Uh oh!

mergify bot commented Dec 27, 2025

Uh oh!

ppppqp commented Dec 27, 2025

Uh oh!

chaunceyjiang commented Dec 31, 2025

Uh oh!

ppppqp commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ppppqp commented Jan 7, 2026

Uh oh!

mergify bot commented Jan 12, 2026

Uh oh!

noooop commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ppppqp commented Dec 25, 2025 •

edited by github-actions bot

Loading

ppppqp Dec 25, 2025 •

edited

Loading

ppppqp commented Dec 27, 2025 •

edited

Loading

ppppqp commented Jan 1, 2026 •

edited

Loading