Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] ray serve + model parallel deepspeed inference #22811

Open
1 of 2 tasks
EricSteinberger opened this issue Mar 3, 2022 · 3 comments
Open
1 of 2 tasks

[Feature] ray serve + model parallel deepspeed inference #22811

EricSteinberger opened this issue Mar 3, 2022 · 3 comments
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical serve Ray Serve Related Issue
Milestone

Comments

@EricSteinberger
Copy link

EricSteinberger commented Mar 3, 2022

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Hi folks,

I'm trying to split a model across multiple gpus within ray serve using deepspeed inference.

I believe it boils down to the new ray processes not being started with the deepspeed launcher.

Could what the launcher does somehow be forwarded to child processes spawned through ray?

Ideally for max scalability, this should work both on one node with multiple GPUs as well as across multiple nodes for very large models. Some ideas for how this UX could be implemented:

  1. serve.start(deepspeed_launcher=True)
  2. a new deepspeedray launcher to launch the serve script with deepspeed's launcher but propagate it.

Some code to demonstrate the use case
import time
from typing import List, Dict, Any

import deepspeed
import torch
from ray import serve
from torch import nn
import time
from typing import List, Dict, Any
​
import deepspeed
import torch
from ray import serve
from torch import nn
​
N_GPUS = 2
###################################################
# -- the snippet below works if launched with 'deepspeed serve_multigpu_deepspeed_min_repro.py'
#    but NOT with 'python3 serve_multigpu_deepspeed_min_repro.py'
​
# nn_torch = nn.Linear(5, 5)
# nn_ds = deepspeed.init_inference(
#     model=nn_torch,
#     mp_size=N_GPUS,
#     dtype=torch.float16,
#     replace_method=False,
#     replace_with_kernel_inject=True,
# )
# exit()
​
​
###################################################
# -- this doesnt work with either. Maybe because what the deepspeed launcher does is likely
#    not propagated on to the new processes?
@serve.deployment(
    name="DeepspeedNetMPNet",
    _autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 1,
    },
    ray_actor_options={"num_gpus": N_GPUS, "num_cpus": 1},
    version="0.0.1",
)
class Net:
    def __init__(self):
        self.nn_torch = nn.Linear(64, 64)
        self.nn_ds = deepspeed.init_inference(
            model=self.nn_torch,
            mp_size=N_GPUS,
            dtype=torch.float16,
            replace_method=False,
            replace_with_kernel_inject=True,
        )
​
    # @serve.batch(max_batch_size=4)
    async def __call__(self,
                       requests_batch: List[torch.Tensor],
                       ) -> List[Dict[str, Any]]:
        with torch.no_grad():
            return self.nn_ds(torch.stack(requests_batch))
​
​
serve.start()
Net.deploy()
​
# Serve will be shut down once the script exits, so keep it alive manually.
while True:
    time.sleep(5)
    print("Deployments:")
    for k, v in serve.list_deployments().items():
        print(f"{k} | {v}\n")

Use case

Multi-GPU model parallelization helps speed up large-neural-net APIs.

Examples:

  1. Single node, multiple GPUs for large neural nets
  2. Very large neural nets across multiple nodes with multiple GPUs each
  3. Speeding up inference for speed-critical API tasks for smaller/medium-scale NNs

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@EricSteinberger EricSteinberger added the enhancement Request for new feature and/or capability label Mar 3, 2022
@shrekris-anyscale shrekris-anyscale added serve Ray Serve Related Issue P2 Important issue, but not time-critical labels Mar 4, 2022
@shrekris-anyscale shrekris-anyscale added this to the Serve backlog milestone Mar 4, 2022
@edoakes edoakes removed the platform label Apr 25, 2022
@jiaodong
Copy link
Member

Hi @EricSteinberger, thanks for providing context for the issue. You're testing a workload that we think is critical but haven't invested much yet, therefore some edges could be rough. I haven't got my hands on DeepSpeed before but after checking into the launcher script I think you would probably have easier time to prototype directly with Ray Actor APIs to spawn and coordinate child processes via actors. Then for advanced communication pattern via NCCL you might find ray.collective helpful: https://docs.ray.io/en/latest/ray-more-libs/ray-collective.html, then move on to scale this distributed inference workload with ray serve.

We need to understand better about your workload first to be more effective -- i've sent you a message on Ray Slack.

@EricSteinberger
Copy link
Author

Hi! I replied on Slack with more info about our usecase. Thank you for looking into the issue!

@jamjambles
Copy link

Hey @jiaodong,

I am also looking to use DeepSpeed + ray to inference using model parallelism.

Has the recommendation changed from above? I would very much appreciate if you could point me to the latest documentation for these concepts.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P2 Important issue, but not time-critical serve Ray Serve Related Issue
Projects
None yet
Development

No branches or pull requests

6 participants