Skip to content

[TPU] Use Ray for default distributed backend #8389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 12, 2024
Merged

[TPU] Use Ray for default distributed backend #8389

merged 2 commits into from
Sep 12, 2024

Conversation

WoosukKwon
Copy link
Collaborator

No description provided.

@WoosukKwon WoosukKwon added the tpu Related to Google TPUs label Sep 12, 2024
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@njhill
Copy link
Member

njhill commented Sep 12, 2024

@WoosukKwon I'm curious of the reason not to use multiprocessing distributed backend for this?

@WoosukKwon
Copy link
Collaborator Author

@njhill Good question. Actually, the MP backend would also work for TPUs. However, I think users such as GKE prefer Ray, because 1) they are interested in multi-host inference (which TPU is quite good at), and 2) they are already familiar with Ray.

@@ -869,6 +869,13 @@ def __init__(
f"distributed executor backend "
f"'{self.distributed_executor_backend}'.")

if current_platform.is_tpu() and self.world_size > 1:
if self.distributed_executor_backend is None:
self.distributed_executor_backend = "ray"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I think this line should be enough to change the default backend to ray in tpu case.

Copy link
Collaborator Author

@WoosukKwon WoosukKwon Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh the error is for those who use distributed_executor_backend="mp".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to error if users explicitly specify the mp backend?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MP backend is not supported for TPUs at the moment. Without this line, the user will get the error:

"/vllm/engine/llm_engine.py", line 505, in _get_executor_cls
    assert distributed_executor_backend is None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by Actually, the MP backend would also work for TPUs

so MP backend for TPU is actually not implemented yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We don't have a executor for tpu + mp.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @njhill if you have any misunderstanding. it is because we currently only have ray backend supported in tpu.

@WoosukKwon WoosukKwon merged commit b71c956 into main Sep 12, 2024
28 of 29 checks passed
@WoosukKwon WoosukKwon deleted the tpu-ray branch September 12, 2024 03:31
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tpu Related to Google TPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants