test: Limiting multi-gpu tests to use Ray as distributed_executor_backend #47
Conversation
| "max_lora_rank": 32, | ||
| "lora_extra_vocab_size": 256 | ||
| "lora_extra_vocab_size": 256, | ||
| "distributed_executor_backend":"ray" |
There was a problem hiding this comment.
For python native multiprocessing mode and KIND_MODEL setting, Triton hits "failed to stop server: Internal - Exit timeout expired.
Few questions on this for my own understanding moving forward:
- Do we know more details or limitations on why this is happening?
- Is this an error happening on server shutdown?
- Is there some issue with the
python native multiprocessingdue to the details of Triton's python backend launching each instance as a separate process?
- Is there some issue with the
- Is this with 1, 2, or any amount of model instances with KIND_MODEL?
There was a problem hiding this comment.
Do we know more details or limitations on why this is happening?
Issue with unclear multi-gpu test fail when upgrading to 0.5.0 versions and up
In PR #5230 vllm changed default executor for distributed serving from Ray to python native multiprocessing for single node processing. This becomes in issue starting with v0.5.1 release.
For python native multiprocessing mode and KIND_MODEL setting, triton hits "failed to stop server: Internal - Exit timeout expired. Exiting immediately." and pt_main_thread processes are never stopped/killed.
Solution: add "distributed_executor_backend":"ray" to model.json
Is this an error happening on server shutdown?
Yes, and I have a reproducer outside of triton
Is this with 1, 2, or any amount of model instances with KIND_MODEL?
If "distributed_executor_backend" field is not specified, than for tp>2 and distributed among a single node, than MP backend kicks in. However, I've noticed that even when tp=1 and "distributed_executor_backend" is specified in model.json, vllm will go through distributed serving even when tp=1. More on the slack channel for this behavior
|
@oandreeva-nv afaik you can setup the --distributed-executor-backend to ray and avoid the usage of MP. From the docs of distributed serving: |
|
@rcarrata That's what I'm doing in this PR. I'm making sure that ray is used for distributed testing. Or did I misunderstood your comment? |
In PR #5230 vllm changed default executor for distributed serving from Ray to
python native multiprocessingfor single node processing. This becomes in issue to Triton starting with v0.5.1 release.For
python native multiprocessingmode and KIND_MODEL setting, Triton hits "failed to stop server: Internal - Exit timeout expired. Exiting immediately." andpt_main_threadprocesses are never stopped/killed. I'll create an issue a bit later.Solution: support only Ray for deploying models with tensor_parallel_size > 1 via "distributed_executor_backend" flag until the issue is fixed.
This PR adjusts our multi-gpu tests according to the above observations