Skip to content

test: Limiting multi-gpu tests to use Ray as distributed_executor_backend #47

Merged
oandreeva-nv merged 9 commits into
mainfrom
oandreeva_vllm_0.5.2
Jul 25, 2024
Merged

test: Limiting multi-gpu tests to use Ray as distributed_executor_backend #47
oandreeva-nv merged 9 commits into
mainfrom
oandreeva_vllm_0.5.2

Conversation

@oandreeva-nv
Copy link
Copy Markdown
Contributor

@oandreeva-nv oandreeva-nv commented Jul 17, 2024

In PR #5230 vllm changed default executor for distributed serving from Ray to python native multiprocessing for single node processing. This becomes in issue to Triton starting with v0.5.1 release.
For python native multiprocessing mode and KIND_MODEL setting, Triton hits "failed to stop server: Internal - Exit timeout expired. Exiting immediately." and pt_main_thread processes are never stopped/killed. I'll create an issue a bit later.

Solution: support only Ray for deploying models with tensor_parallel_size > 1 via "distributed_executor_backend" flag until the issue is fixed.

This PR adjusts our multi-gpu tests according to the above observations

@oandreeva-nv oandreeva-nv marked this pull request as ready for review July 17, 2024 23:13
@oandreeva-nv oandreeva-nv changed the title Oandreeva vllm 0.5.2 Limiting multi-gpu tests to use Ray as distributed_executor_backend Jul 17, 2024
@rmccorm4 rmccorm4 changed the title Limiting multi-gpu tests to use Ray as distributed_executor_backend test: Limiting multi-gpu tests to use Ray as distributed_executor_backend Jul 17, 2024
"max_lora_rank": 32,
"lora_extra_vocab_size": 256
"lora_extra_vocab_size": 256,
"distributed_executor_backend":"ray"
Copy link
Copy Markdown
Contributor

@rmccorm4 rmccorm4 Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For python native multiprocessing mode and KIND_MODEL setting, Triton hits "failed to stop server: Internal - Exit timeout expired.

Few questions on this for my own understanding moving forward:

  1. Do we know more details or limitations on why this is happening?
  2. Is this an error happening on server shutdown?
    • Is there some issue with the python native multiprocessing due to the details of Triton's python backend launching each instance as a separate process?
  3. Is this with 1, 2, or any amount of model instances with KIND_MODEL?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know more details or limitations on why this is happening?

Issue with unclear multi-gpu test fail when upgrading to 0.5.0 versions and up
In PR #5230 vllm changed default executor for distributed serving from Ray to python native multiprocessing for single node processing. This becomes in issue starting with v0.5.1 release.
For python native multiprocessing mode and KIND_MODEL setting, triton hits "failed to stop server: Internal - Exit timeout expired. Exiting immediately." and pt_main_thread processes are never stopped/killed.
Solution: add "distributed_executor_backend":"ray" to model.json

Is this an error happening on server shutdown?

Yes, and I have a reproducer outside of triton

Is this with 1, 2, or any amount of model instances with KIND_MODEL?

If "distributed_executor_backend" field is not specified, than for tp>2 and distributed among a single node, than MP backend kicks in. However, I've noticed that even when tp=1 and "distributed_executor_backend" is specified in model.json, vllm will go through distributed serving even when tp=1. More on the slack channel for this behavior

@rcarrata
Copy link
Copy Markdown

@oandreeva-nv afaik you can setup the --distributed-executor-backend to ray and avoid the usage of MP.

From the docs of distributed serving:
Multiprocessing will be used by default when not running in a Ray placement group and if there are sufficient GPUs available on the same node for the configured tensor_parallel_size, otherwise Ray will be used. This default can be overridden via the LLM class distributed-executor-backend argument or --distributed-executor-backend API server argument. Set it to mp for multiprocessing or ray for Ray. It’s not required for Ray to be installed for the multiprocessing case.

@oandreeva-nv
Copy link
Copy Markdown
Contributor Author

@rcarrata That's what I'm doing in this PR. I'm making sure that ray is used for distributed testing. Or did I misunderstood your comment?

@oandreeva-nv oandreeva-nv merged commit 05c5a8b into main Jul 25, 2024
@oandreeva-nv oandreeva-nv deleted the oandreeva_vllm_0.5.2 branch July 25, 2024 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants