[Core] Pipeline Parallel support for Model Runner V2#33960
[Core] Pipeline Parallel support for Model Runner V2#33960WoosukKwon merged 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces pipeline parallelism (PP) support for the Model Runner V2 by adding a new PPHandler class. This is a great approach as it encapsulates the PP-related logic, keeping the model runner code cleaner and more maintainable. The changes in GPUModelRunner correctly integrate this handler for different stages of model execution, such as dummy runs, model execution, and token sampling. The distinction between pipeline stages (first, middle, last) is handled well throughout the changes. The current implementation uses blocking communication, which is a reasonable first step, with non-blocking communication planned for future work.
I have one suggestion to improve the robustness of the PPHandler by adding a more explicit check for expected tensor keys, which will improve the developer experience for those implementing PP support in new models.
vllm/v1/worker/gpu/pp_handler.py
Outdated
| if self.produces_final_output: | ||
| # Last rank: extract hidden states for sampling | ||
| if isinstance(hidden_states, IntermediateTensors): | ||
| return hidden_states["hidden_states"] |
There was a problem hiding this comment.
Accessing hidden_states["hidden_states"] directly can lead to an uninformative KeyError if a model's forward method on the last pipeline stage returns an IntermediateTensors object that doesn't contain the "hidden_states" key. To improve robustness and provide a clearer error message for model developers, it's better to check for the key's existence and raise a ValueError with an explanatory message if it's missing.
| return hidden_states["hidden_states"] | |
| if "hidden_states" not in hidden_states.tensors: | |
| raise ValueError( | |
| "IntermediateTensors from model on the last PP rank must " | |
| "contain 'hidden_states' tensor.") | |
| return hidden_states["hidden_states"] |
|
🚀🚀🚀 |
b8c0f49 to
c863e21
Compare
e169ff1 to
5afb5ae
Compare
WoosukKwon
left a comment
There was a problem hiding this comment.
@ZhanqiuHu Thanks for the PR! I like the design of using PP handler to encapsulate the PP-related logics.
That said, I think we could improve the model runner code and make the change even smaller. Please check out my comments.
vllm/v1/worker/gpu/model_runner.py
Outdated
| # PP input preparation: handler centralizes all PP input logic. | ||
| model_inputs = self.pp_handler.prepare_model_inputs( | ||
| input_batch.input_ids, | ||
| positions, | ||
| input_batch.inputs_embeds, | ||
| intermediate_tensors, | ||
| ) |
There was a problem hiding this comment.
Can we keep the model inputs explicit? I’m not a fan of encapsulating them into a separate object. For example, when supporting CUDA graphs, it’s critical to reason about all inputs and ensure they use consistent memory addresses. This abstraction makes that harder to see.
vllm/v1/worker/gpu/model_runner.py
Outdated
| self.execute_model_state = hidden_states, input_batch, kv_connector_output | ||
| output = self.pp_handler.prepare_output(hidden_states, kv_connector_output) | ||
|
|
||
| if isinstance(output, IntermediateTensors): |
There was a problem hiding this comment.
Can we just check the PP rank directly?
vllm/v1/worker/gpu/model_runner.py
Outdated
| # Broadcast to non-last ranks (handles spec decode multi-token) | ||
| self.pp_handler.maybe_broadcast_sampled_tokens( | ||
| sampler_output, num_sampled, num_rejected | ||
| ) |
There was a problem hiding this comment.
I think it'd be nice if we can do something like
| # Broadcast to non-last ranks (handles spec decode multi-token) | |
| self.pp_handler.maybe_broadcast_sampled_tokens( | |
| sampler_output, num_sampled, num_rejected | |
| ) | |
| if self.use_pp: | |
| # Broadcast to non-last ranks (handles spec decode multi-token) | |
| self.pp_handler.maybe_broadcast_sampled_tokens( | |
| sampler_output, num_sampled, num_rejected | |
| ) |
…code readiness Co-authored with @yewentao256 Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
5afb5ae to
ef1f640
Compare
|
Hi @ZhanqiuHu, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @WoosukKwon, Thanks for the review! Just pushed the updates: Changes:
Note:
|
ef1f640 to
b872030
Compare
vllm/v1/worker/gpu/model_runner.py
Outdated
| # NOTE: In PP mode, every rank must construct the *exact* same request | ||
| # ordering for the batched token dimension. Python's `sorted(..., key=...)` | ||
| # is stable, so ties would otherwise be broken by the input dict's | ||
| # insertion order, which can differ across processes. Use `req_id` as a | ||
| # deterministic tie-breaker to keep PP stages in sync. | ||
| req_ids = sorted( | ||
| num_tokens_per_req, | ||
| key=lambda req_id: (num_tokens_per_req[req_id], req_id), | ||
| ) |
There was a problem hiding this comment.
I don't think this should be needed. The num_tokens_per_req dict received by each rank should be identical, and sorted is stable.
I found that this sort is actually quite a bit faster without the lambda.
Changes: 1. Expose use_pp checking and PP rank checking explicitly in model runner. Only create pp_handler when PP is enabled. 2. Removed unused code in pp_handler. 3. Make model inputs and outputs processing explicit in model runner. Note: 1. Right now, pp_handler class doesn't hold state, just holds methods. Technically no class is needed. Maybe in the future we might need to hold state for async, but not sure. Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
b872030 to
faeec31
Compare
| if self.use_pp and not get_pp_group().is_first_rank: | ||
| # Non-first PP rank: forward with intermediate tensors. | ||
| assert intermediate_tensors is not None | ||
| hidden_states = self.model( | ||
| input_ids=None, | ||
| positions=positions, | ||
| inputs_embeds=None, | ||
| intermediate_tensors=intermediate_tensors, | ||
| ) | ||
| else: | ||
| hidden_states = self.model( | ||
| input_ids=input_batch.input_ids, | ||
| positions=positions, | ||
| inputs_embeds=input_batch.inputs_embeds, | ||
| ) |
There was a problem hiding this comment.
suggested:
| if self.use_pp and not get_pp_group().is_first_rank: | |
| # Non-first PP rank: forward with intermediate tensors. | |
| assert intermediate_tensors is not None | |
| hidden_states = self.model( | |
| input_ids=None, | |
| positions=positions, | |
| inputs_embeds=None, | |
| intermediate_tensors=intermediate_tensors, | |
| ) | |
| else: | |
| hidden_states = self.model( | |
| input_ids=input_batch.input_ids, | |
| positions=positions, | |
| inputs_embeds=input_batch.inputs_embeds, | |
| ) | |
| if get_pp_group().is_first_rank: | |
| input_ids = input_batch.input_ids | |
| inputs_embeds = input_batch.inputs_embeds | |
| else: | |
| # Non-first PP rank: forward with intermediate tensors. | |
| input_ids, inputs_embeds = None, None | |
| assert intermediate_tensors is not None | |
| hidden_states = self.model( | |
| input_ids=input_ids, | |
| positions=positions, | |
| inputs_embeds=inputs_embeds, | |
| intermediate_tensors=intermediate_tensors, | |
| ) |
|
@ZhanqiuHu do you know the reason for the performance drop relative to V1? Is it cudagraphs? |
I benchmarked both v1 and v2 with But I am looking deeper into the issue. |
|
Hi @njhill, I was benchmarking V1 vs V2 PP performance and noticed that in previous runs I didn't disable prefix caching, and results between benchmark runs actually varied by a lot. I now added
Baseline with No PP for reference (click to expand)
I think the performance results are comparable between V1 and V2. I also checked the flow regarding PP between V1 and V2 and it should be the same, except that now that |
WoosukKwon
left a comment
There was a problem hiding this comment.
@ZhanqiuHu LGTM! Thanks for addressing all comments. I'm excited that we have such a clean implementation of PP. Great work 👍
Summary
Co-authored with @yewentao256
Add Pipeline Parallel (PP) support to Model Runner V2 (
vllm/v1/worker/gpu/model_runner.py). This introduces a modularPPHandlerclass that encapsulates all PP logic, keeping the model runner code clean. Verified correct output and competitive throughput against the V1 baseline.Related: #32455 (Q1 2026 Roadmap) — PP is listed as a missing feature for Model Runner V2.
Changes
In Model Runner (
vllm/v1/worker/gpu/model_runner.py)execute_model:IntermediateTensorsto next stage.IntermediateTensorsfrom previous stage, run model forward, sendIntermediateTensorsto next stage.IntermediateTensorsfrom previous stage, run model forward, store hidden states for sampling.PPHandler.sample_tokens:sampled_token_ids/num_sampled/num_rejectedto all other ranks, then returnModelRunnerOutput.postprocessto update local state, returnNone._dummy_run: Create dummy intermediate tensors for non-first ranks; skip sampler for non-last ranks.capture_model: Skip CUDA graph capture when PP is enabled (eager-only for now).Deterministic request sorting: Use
req_idas tie-breaker to ensure consistent batch ordering across PP ranks.Multimodal guard: Only prepare MM embeddings on the first PP rank.
New Class: PPHandler (
vllm/v1/worker/gpu/pp_handler.py)New module with a
PPHandlerclass. All public methods no-op when PP is disabled or called on an inapplicable rank, so callers don't need guard conditions.maybe_broadcast_sampled_tokenssampled_token_ids,num_sampled,num_rejectedto all PP ranks. No-ops on non-last ranks.maybe_receive_sampled_tokensNoneon last rank. Supports variablemax_sample_lenfor future speculative decoding + PP.prepare_model_inputsmodel.forward()kwargs — raw inputs for first rank, intermediate tensors for others.prepare_outputIntermediateTensors(non-last ranks).Helper classes:
PPConfig(dataclass with rank role, size, index) andPPRoleenum (NO_PP,FIRST,MIDDLE,LAST).Future Work
Test Plan
Accuracy — lm_eval gsm8k (5-shot):
Throughput — decode-heavy random workload:
vllm bench serve --model $MODEL --dataset-name random --host 127.0.0.1 \ --random-input-len 2 --random-output-len 512 --num-prompts 128 \ --port 9256 --num-warmups 16Test Results
Accuracy (gsm8k, PP=2, Qwen3-30B-A3B MoE FP8)
V2 matches V1 accuracy. The 0.0007 difference in strict-match is within noise (±0.0114 stderr).
Throughput (PP=2, Qwen3-30B-A3B MoE FP8, input=2, output=512, 128 prompts)
Note: Claude (Anthropic) was used as a coding assistant during development of this PR.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.