[Feature] ray serve + model parallel deepspeed inference #22811

EricSteinberger · 2022-03-03T21:22:32Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

Hi folks,

I'm trying to split a model across multiple gpus within ray serve using deepspeed inference.

I believe it boils down to the new ray processes not being started with the deepspeed launcher.

Could what the launcher does somehow be forwarded to child processes spawned through ray?

Ideally for max scalability, this should work both on one node with multiple GPUs as well as across multiple nodes for very large models. Some ideas for how this UX could be implemented:

serve.start(deepspeed_launcher=True)
a new deepspeedray launcher to launch the serve script with deepspeed's launcher but propagate it.

Some code to demonstrate the use case
import time
from typing import List, Dict, Any

import deepspeed
import torch
from ray import serve
from torch import nn
import time
from typing import List, Dict, Any

import deepspeed
import torch
from ray import serve
from torch import nn

N_GPUS = 2
###################################################
# -- the snippet below works if launched with 'deepspeed serve_multigpu_deepspeed_min_repro.py'
#    but NOT with 'python3 serve_multigpu_deepspeed_min_repro.py'

# nn_torch = nn.Linear(5, 5)
# nn_ds = deepspeed.init_inference(
#     model=nn_torch,
#     mp_size=N_GPUS,
#     dtype=torch.float16,
#     replace_method=False,
#     replace_with_kernel_inject=True,
# )
# exit()


###################################################
# -- this doesnt work with either. Maybe because what the deepspeed launcher does is likely
#    not propagated on to the new processes?
@serve.deployment(
    name="DeepspeedNetMPNet",
    _autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 1,
    },
    ray_actor_options={"num_gpus": N_GPUS, "num_cpus": 1},
    version="0.0.1",
)
class Net:
    def __init__(self):
        self.nn_torch = nn.Linear(64, 64)
        self.nn_ds = deepspeed.init_inference(
            model=self.nn_torch,
            mp_size=N_GPUS,
            dtype=torch.float16,
            replace_method=False,
            replace_with_kernel_inject=True,
        )

    # @serve.batch(max_batch_size=4)
    async def __call__(self,
                       requests_batch: List[torch.Tensor],
                       ) -> List[Dict[str, Any]]:
        with torch.no_grad():
            return self.nn_ds(torch.stack(requests_batch))


serve.start()
Net.deploy()

# Serve will be shut down once the script exits, so keep it alive manually.
while True:
    time.sleep(5)
    print("Deployments:")
    for k, v in serve.list_deployments().items():
        print(f"{k} | {v}\n")

Use case

Multi-GPU model parallelization helps speed up large-neural-net APIs.

Examples:

Single node, multiple GPUs for large neural nets
Very large neural nets across multiple nodes with multiple GPUs each
Speeding up inference for speed-critical API tasks for smaller/medium-scale NNs

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

jiaodong · 2022-04-25T18:36:40Z

Hi @EricSteinberger, thanks for providing context for the issue. You're testing a workload that we think is critical but haven't invested much yet, therefore some edges could be rough. I haven't got my hands on DeepSpeed before but after checking into the launcher script I think you would probably have easier time to prototype directly with Ray Actor APIs to spawn and coordinate child processes via actors. Then for advanced communication pattern via NCCL you might find ray.collective helpful: https://docs.ray.io/en/latest/ray-more-libs/ray-collective.html, then move on to scale this distributed inference workload with ray serve.

We need to understand better about your workload first to be more effective -- i've sent you a message on Ray Slack.

EricSteinberger · 2022-04-26T11:14:13Z

Hi! I replied on Slack with more info about our usecase. Thank you for looking into the issue!

jamjambles · 2024-04-30T08:12:28Z

Hey @jiaodong,

I am also looking to use DeepSpeed + ray to inference using model parallelism.

Has the recommendation changed from above? I would very much appreciate if you could point me to the latest documentation for these concepts.

Thanks!

EricSteinberger added the enhancement Request for new feature and/or capability label Mar 3, 2022

shrekris-anyscale added serve Ray Serve Related Issue P2 Important issue, but not time-critical labels Mar 4, 2022

shrekris-anyscale added this to the Serve backlog milestone Mar 4, 2022

AmeerHajAli added the platform label Mar 26, 2022

edoakes removed the platform label Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] ray serve + model parallel deepspeed inference #22811

[Feature] ray serve + model parallel deepspeed inference #22811

EricSteinberger commented Mar 3, 2022 •

edited

Loading

jiaodong commented Apr 25, 2022

EricSteinberger commented Apr 26, 2022

jamjambles commented Apr 30, 2024

[Feature] ray serve + model parallel deepspeed inference #22811

[Feature] ray serve + model parallel deepspeed inference #22811

Comments

EricSteinberger commented Mar 3, 2022 • edited Loading

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

jiaodong commented Apr 25, 2022

EricSteinberger commented Apr 26, 2022

jamjambles commented Apr 30, 2024

EricSteinberger commented Mar 3, 2022 •

edited

Loading