[Feature] ray serve + model parallel deepspeed inference #22811
Labels
enhancement
Request for new feature and/or capability
P2
Important issue, but not time-critical
serve
Ray Serve Related Issue
Milestone
Search before asking
Description
Hi folks,
I'm trying to split a model across multiple gpus within ray serve using deepspeed inference.
I believe it boils down to the new ray processes not being started with the deepspeed launcher.
Could what the launcher does somehow be forwarded to child processes spawned through ray?
Ideally for max scalability, this should work both on one node with multiple GPUs as well as across multiple nodes for very large models. Some ideas for how this UX could be implemented:
Some code to demonstrate the use case
import time
from typing import List, Dict, Any
Use case
Multi-GPU model parallelization helps speed up large-neural-net APIs.
Examples:
Related issues
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: