Merged
Conversation
04e1518 to
3b13bfc
Compare
3b13bfc to
ce6e62e
Compare
ce6e62e to
06f3418
Compare
Contributor
|
With examples/experimental/simple_offline_bench.py, I have observed that |
rebel-jaehwang
approved these changes
Jan 13, 2026
rebel-jaehwang
added a commit
that referenced
this pull request
Jan 30, 2026
* fix: limit decode bs to (max num seqs // pp size) * tmp: pad decode inputs to max_num_seqs // pp_size * add: simple offline benchmark script * refac: consolidate self.max_batch_size and decode_max_batch_size * fix: clearer perf report --------- Co-authored-by: Jaehwang Jung <jaehwang.jung@rebellions.ai>
rebel-jaehwang
added a commit
that referenced
this pull request
Jan 30, 2026
* fix: limit decode bs to (max num seqs // pp size) * tmp: pad decode inputs to max_num_seqs // pp_size * add: simple offline benchmark script * refac: consolidate self.max_batch_size and decode_max_batch_size * fix: clearer perf report --------- Co-authored-by: Jaehwang Jung <jaehwang.jung@rebellions.ai>
rebel-jiwoopark
pushed a commit
that referenced
this pull request
Feb 4, 2026
* fix: limit decode bs to (max num seqs // pp size) * tmp: pad decode inputs to max_num_seqs // pp_size * add: simple offline benchmark script * refac: consolidate self.max_batch_size and decode_max_batch_size * fix: clearer perf report --------- Co-authored-by: Jaehwang Jung <jaehwang.jung@rebellions.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🚀 Summary of Changes
In upstream vLLM v1, pipeline parallelism (PP) was implemented in a subsequent PR. At a high level, the scheduler divides up to max_num_seqs requests into pp_size groups, and requests within each group are batched and executed together. Since there are pp_size such batches, they are scheduled back-to-back so that the pipeline runs without bubbles.
One important detail is how requests are assigned to groups. Requests are not explicitly grouped in a round-robin or load-balanced manner. Instead, a request is implicitly assigned to a group based on which batch happens to be scheduled at the time the request is first admitted. Once a request is scheduled and executes a single inference step, it must pass through all pipeline stages before it can be scheduled again. As a result, requests that were initially batched together naturally continue to remain in the same group across subsequent iterations.
Because of this behavior, group sizes can become uneven. In the worst case it is possible for all requests to end up in a single group while the remaining groups are empty, effectively reducing pipeline utilization to 1 / pp_size. However, this worst-case behavior is unlikely in practice. As long as max_num_seqs is configured sufficiently large and there are enough concurrent requests to keep the pipeline filled, you can generally expect meaningful throughput improvements from PP.
If it’s still unclear, please take a close look at the codes below and trace through the execution flow.
Unfortunately, this is not the case for our vllm-rbln. Since we do not support mixed batching
and we also force the prefill batch size to 1, decode requests cannot be scheduled while a prefill is in progress and they remain runnable but effectively blocked until the prefill finishes. Eventually, prefill completes (typically because the running set reaches max_num_seqs or the KV cache can no longer accommodate new requests), and then we move on to scheduling decode. At that point, all pending decode requests become schedulable at the same time, and we end up packing them into a single batch. As a result, the “worst case” on the GPU side becomes essentially inevitable for us, which is why PP utilization collapses and performance can degrade severely.In the long run, the right solution is clearly to
lift the prefill batch-size constraintand allow prefill and decode to be scheduled in the same batch. If we could do that, we wouldn’t need to be having this discussion in the first place.Unfortunately, that does not seem feasible in the near term. So as a temporary workaround, this PR introduces the following changes:
📌 Related Issues / Tickets
✅ Type of Change
feature)model)core)bug-fix)perf)refactor)docs)other): please describe🧪 How to Test
python examples/experimental/simple_offline_bench.pypython examples/experimental/simple_offline_bench.py --pipeline-parallel-size 2...📸 Screenshots / Logs (if applicable)
📋 Checklist
💬 Notes