[Feature] Support Pipeline Parallism in torchrun SPMD offline inference for V1#17827
[Feature] Support Pipeline Parallism in torchrun SPMD offline inference for V1#17827simon-mo merged 10 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
comaniac
left a comment
There was a problem hiding this comment.
Overall LGTM. The main question is not sure why we need pipeline_parallel_broadcast_output and cannot make it always true.
youkaichao
left a comment
There was a problem hiding this comment.
for efficient pp, we would want different stages of pp from different batches run concurrently. would this broadcast make them serialized? like there will be only one batch and one stage running for the full engine.
vllm/v1/worker/gpu_model_runner.py
Outdated
There was a problem hiding this comment.
should we return earlier if it's not first rank?
There was a problem hiding this comment.
still need it for complete sync in current solution
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Lucia Fang <fanglu@fb.com>
Signed-off-by: Lucia Fang <fanglu@fb.com>
Signed-off-by: Lucia Fang <fanglu@fb.com>
thanks @youkaichao, right now running torch.run in a pure synced way to align with SPMD, so this is not the ideal solution for efficient PP, only one batch and one stage. I will think about how to design and improve it so we have multi-batches can overlap while compatible with torch.run as a follow up. |
houseroad
left a comment
There was a problem hiding this comment.
As discussed with @luccafong , landing this first to unblock some use case.
Will enhance the perf in the following PR.
Signed-off-by: Lucia Fang <fanglu@fb.com>
Signed-off-by: Lucia Fang <fanglu@fb.com>
ruisearch42
left a comment
There was a problem hiding this comment.
LG with a couple comments
| last_rank_in_group = pp_group_ranks - 1 | ||
| if self.parallel_config.distributed_executor_backend \ | ||
| == "external_launcher" and len(get_pp_group().ranks) > 0: | ||
| model_output_broadcast_data = get_pp_group().broadcast_tensor_dict( |
There was a problem hiding this comment.
Could you explain why we need this broadcast?
There was a problem hiding this comment.
Yes, added as comments, now we enable by sync all ranks, will improve to reduce pp bubles in following PR.
Signed-off-by: Lucia Fang <fanglu@fb.com>
Signed-off-by: Lucia Fang <fanglu@fb.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Lucia Fang <fanglu@fb.com>
|
CI build failure not related, also pulled latest change, @houseroad |
|
Actually v1 test failure may be related: E AssertionError: pipeline model parallel group is not initialized cc: @luccafong |
this is a trunk failure Added a fix here: |
…ce for V1 (vllm-project#17827) Signed-off-by: Lucia Fang <fanglu@fb.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
This PR add support for PP on torchrun offline inference, note this now does not support overlapping mircobatches so not the most efficient way to ublock PP use cases, will improve as follow ups.
output