[BugFix] Make PD work with Ray#21072
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces a fix to make distributed KV cache transfer work correctly with Ray. The changes in vllm/v1/executor/ray_distributed_executor.py add logic to aggregate finished_sending and finished_recving statuses from all workers, which is crucial for tracking request completion in a distributed setup. The implementation correctly handles both synchronous and asynchronous execution paths. The corresponding change in vllm/executor/ray_utils.py ensures this information is propagated to the model output. The debugging log statements should be removed from production code.
vllm/executor/ray_utils.py
Outdated
| output = self.worker.model_runner.execute_model( | ||
| scheduler_output, intermediate_tensors) | ||
|
|
||
| logger.info(f"in the ray_utils.py ...") |
vllm/executor/ray_utils.py
Outdated
| output = copy.copy(EMPTY_MODEL_RUNNER_OUTPUT) | ||
| output.finished_sending = finished_sending | ||
| output.finished_recving = finished_recving | ||
| logger.info(f"Have succesfully set finished_sending: {finished_sending}, finished_recving: {finished_recving}") |
| # Block and get results from all workers | ||
| outputs = [ref.get() for ref in refs] | ||
| return self._aggregate_workers_output(outputs) | ||
| else: |
| return 2 | ||
| return self.parallel_config.pipeline_parallel_size | ||
|
|
||
| def _aggregate_workers_output( |
There was a problem hiding this comment.
Can we move these to utility functions (shared with multiproc executor), or maybe a mixin superclass also containing the _send_remaining_count and _recv_remaining_count fields?
| if finished_sending: | ||
| output.finished_sending = finished_sending | ||
| if finished_recving: | ||
| output.finished_recving = finished_recving |
There was a problem hiding this comment.
| if finished_sending: | |
| output.finished_sending = finished_sending | |
| if finished_recving: | |
| output.finished_recving = finished_recving | |
| if finished_sending: | |
| output.finished_sending = finished_sending | |
| if finished_recving: | |
| output.finished_recving = finished_recving |
See also #21048 which has a fix for this that we should get in.
| hidden_states: torch.Tensor, | ||
| num_scheduled_tokens: int, | ||
| num_scheduled_tokens_np: np.ndarray, | ||
| finished_sending: Optional[set[str]], |
There was a problem hiding this comment.
Why are all of these changes to gpu_model_runner.py and gpu_worker.py needed? I don't think they affect what's being fixed by this PR and so would be best to revert (unless I'm missing something).
There was a problem hiding this comment.
so ray distributed executor runs the runner directly (not gpu_worker), it's better for the model_runner to own the logic of filling in the finished_sending and finished_recving fields in the output. So I basically reverted the changes to gpu_model_runner introduced in #19555. In that PR, for some reason that I don't understand, the logic is taken out of the model_runner and is put into the worker.
There was a problem hiding this comment.
This change was made to support PP.
Notice these lines in GPUModelRunner::execute_model:
if not get_pp_group().is_last_rank:
# For mid-pipeline stages, return the hidden states.
if not broadcast_pp_output:
return hidden_states
There was a problem hiding this comment.
so ray distributed executor runs the runner directly
@kouroshHakha could you point to where this is the case? From a quick look it appears that it's still a bit entangled with V0 logic but ultimately uses a RayWorkerWrapper which wraps a WorkerBase which should in this case resolve to a vllm.v1.worker.gpu_worker.Worker.
There was a problem hiding this comment.
So here is the flow with Ray as the distributed backend in v1:
v1.RayDistributedExecutor.execute_model() --> RayWrapper.execute_model_ray() (or RayWrapper.execute_model_spmd()) --> worker.model_runner.execute_model()
Therefore the worker.execute_model() logic is skipped and it directly interacts with model_runner.execute_model().
There was a problem hiding this comment.
code for execute_model_ray: https://github.com/vllm-project/vllm/blob/main/vllm/executor/ray_utils.py#L135
There was a problem hiding this comment.
@orozery Is there a test / script that I should target to make sure the intended pp logic still works with my changes?
There was a problem hiding this comment.
FWIW, there should be no regression for test_pipeline_parallel.py , not sure if there is a test for PP + P/D
There was a problem hiding this comment.
@orozery I see what you are saying. Basically for PP, each pp rank (except the last one) will forward an empty output back to the scheduler. You want those empty outputs to have the correct finished/recved attributes. I added that logic back but leaving the main model_runner logic intact. Let's get this merged asap so that it can catch the 0.10.0 train since it's a massive regression in the ray behavior. We can follow up with better solutions after.
There was a problem hiding this comment.
not sure if there is a test for PP + P/D
Based on my research there is none. I also couldn't get the master run PP + P/D so not sure if at any point that combo was supported and now it's regressed? Or was it not supported at all.
| # When PP is not used, we block here until the result is available. | ||
| if self.max_concurrent_batches == 1: | ||
| return refs[0].get() | ||
| if not self.has_connector: |
There was a problem hiding this comment.
I think the changes below here in this file should be all that's needed in this PR (apart from moving those other two functions so that they can be shared).
kouroshHakha
left a comment
There was a problem hiding this comment.
cc @robertgshaw2-redhat @njhill I think it's ready for review now.
| output = EMPTY_MODEL_RUNNER_OUTPUT | ||
|
|
||
| assert isinstance(output, ModelRunnerOutput) | ||
| if has_kv_transfer_group(): |
There was a problem hiding this comment.
The changes in gpu_model_runner.py and gpu_worker.py are revert of what was done in #19555
| @@ -0,0 +1,108 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
Adding the tests and bug fix from #21048
There was a problem hiding this comment.
That's now been merged to main so can rebase.
ruisearch42
left a comment
There was a problem hiding this comment.
ray executor part LGTM, haven't looked at the KV part in detail
|
This pull request has merge conflicts that must be resolved before it can be |
| @@ -0,0 +1,108 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
That's now been merged to main so can rebase.
| hidden_states: torch.Tensor, | ||
| num_scheduled_tokens: int, | ||
| num_scheduled_tokens_np: np.ndarray, | ||
| finished_sending: Optional[set[str]], |
There was a problem hiding this comment.
so ray distributed executor runs the runner directly
@kouroshHakha could you point to where this is the case? From a quick look it appears that it's still a bit entangled with V0 logic but ultimately uses a RayWorkerWrapper which wraps a WorkerBase which should in this case resolve to a vllm.v1.worker.gpu_worker.Worker.
| return "NHD" | ||
|
|
||
|
|
||
| class KVOutputAggregator: |
…ix-ray-pd Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
…ix-ray-pd Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
njhill
left a comment
There was a problem hiding this comment.
Thanks @kouroshHakha, looks ok to me as an immediate fix but there may be a couple of things we can look at streamlining after (like the mock import thing and some more consolidation of the executor/worker/runner structure when untangling some v0 parts).
|
@njhill Yes. 100%. I just discussed this with @ruisearch42. He will own the consolidation effort, basically we want to re-architect the ray implementation in a way that it inherits the changes to the max. extend possible + increase the test coverage in a systematic way. We should target those for the next release. |
| # set the aggregated finished_sending / finished_recving | ||
| # if output.finished_sending/recving is not empty, but the other ranks | ||
| # still have unfinished send/recv, we want to set the aggregated | ||
| # finished_sending/recving to None until all ranks have finished |
There was a problem hiding this comment.
Any reason why it's set to None instead of empty set ?
There was a problem hiding this comment.
I think it's imposed by higher level logic. This part of the PR is inheriting existing logic on master at the time btw.
…nch (#2122) ### What this PR does / why we need it? We notice that vllm's main branch merged the PR vllm-project/vllm#21072 and vllm-project/vllm#21473 to support ray backend and fix some rebase bug from previous change. Those changes makes the disaggregate pd in vllm ascend breaks in some scenario. In this PR, we adopt those changes to make sure the `llmdatddist_c_mgr_connector` works fine on the newest vllm main branch. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? relevant ut will be added to make sure the functionality of those changes. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ad57f23 --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: x22x22 <wadeking@qq.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…nch (vllm-project#2122) ### What this PR does / why we need it? We notice that vllm's main branch merged the PR vllm-project/vllm#21072 and vllm-project/vllm#21473 to support ray backend and fix some rebase bug from previous change. Those changes makes the disaggregate pd in vllm ascend breaks in some scenario. In this PR, we adopt those changes to make sure the `llmdatddist_c_mgr_connector` works fine on the newest vllm main branch. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? relevant ut will be added to make sure the functionality of those changes. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ad57f23 --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
…nch (vllm-project#2122) ### What this PR does / why we need it? We notice that vllm's main branch merged the PR vllm-project/vllm#21072 and vllm-project/vllm#21473 to support ray backend and fix some rebase bug from previous change. Those changes makes the disaggregate pd in vllm ascend breaks in some scenario. In this PR, we adopt those changes to make sure the `llmdatddist_c_mgr_connector` works fine on the newest vllm main branch. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? relevant ut will be added to make sure the functionality of those changes. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ad57f23 --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
…nch (vllm-project#2122) ### What this PR does / why we need it? We notice that vllm's main branch merged the PR vllm-project/vllm#21072 and vllm-project/vllm#21473 to support ray backend and fix some rebase bug from previous change. Those changes makes the disaggregate pd in vllm ascend breaks in some scenario. In this PR, we adopt those changes to make sure the `llmdatddist_c_mgr_connector` works fine on the newest vllm main branch. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? relevant ut will be added to make sure the functionality of those changes. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ad57f23 --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
…nch (vllm-project#2122) ### What this PR does / why we need it? We notice that vllm's main branch merged the PR vllm-project/vllm#21072 and vllm-project/vllm#21473 to support ray backend and fix some rebase bug from previous change. Those changes makes the disaggregate pd in vllm ascend breaks in some scenario. In this PR, we adopt those changes to make sure the `llmdatddist_c_mgr_connector` works fine on the newest vllm main branch. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? relevant ut will be added to make sure the functionality of those changes. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ad57f23 --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
| output.finished_recving = finished_recving | ||
|
|
||
| # Clear KVConnector state for this step. | ||
| get_kv_transfer_group().clear_connector_metadata() |
There was a problem hiding this comment.
Why was clear_connector_metadata removed here?
…nch (vllm-project#2122) ### What this PR does / why we need it? We notice that vllm's main branch merged the PR vllm-project/vllm#21072 and vllm-project/vllm#21473 to support ray backend and fix some rebase bug from previous change. Those changes makes the disaggregate pd in vllm ascend breaks in some scenario. In this PR, we adopt those changes to make sure the `llmdatddist_c_mgr_connector` works fine on the newest vllm main branch. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? relevant ut will be added to make sure the functionality of those changes. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ad57f23 --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Fixes #21070
Test Plan
Test Result
(Optional) Documentation Update