-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] LLM APIs for Ray Data and Ray Serve #50639
Comments
This is a great RFC. I want to check if |
This looks excellent. When development begins I’d like to lend a hand. |
@lizzzcai And for multiplexing LoRA adaptors that's indeed the plan. So you can have share the base model across all of them and only swap out the adaptor weights when a new request comes in. Further more, part of the scope is to have multiple base model support as well. Say you have 8 GPUs. You can serve llama-3-8b on 1 GPU qwen-8b on 1 GPU, qwen-32B on 2GPUs with tp=2 and llama-70b on 4 GPUs with tp=4. Each base model can have multi-lora support so you can serve arbitrary number of their fine-tuned variants with the same resources. |
@lizzzcai @justinrmiller thanks a bunch for your comments -- please let us know if you have other feature requests / use cases we should design for. @justinrmiller we've started development (for example #50494, #50270) -- and would love your help as there's tons to do. Maybe we can connect on Slack? |
Very cool @justinrmiller |
@Or-Levi - are you interested in online or offline? Gemini should be supported via the HttpRequestProcessor, if you're looking for offline |
## Why are these changes needed? Adds user guide and link-ins for Ray Data documentation. This is part of the #50639 thread of work. This is based on #50494 cc @comaniac @gvspraveen @kouroshHakha ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: Cody Yu <[email protected]>
Yes, offline. Got it |
…oject#50674) ## Why are these changes needed? Adds user guide and link-ins for Ray Data documentation. This is part of the ray-project#50639 thread of work. This is based on ray-project#50494 cc @comaniac @gvspraveen @kouroshHakha ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: Cody Yu <[email protected]> Signed-off-by: 400Ping <[email protected]>
Hi all, we've just merged an initial version of Ray Data LLM onto master. Documentation is here: https://docs.ray.io/en/master/data/working-with-llms.html You can try it out by installing nightly: https://docs.ray.io/en/master/ray-overview/installation.html#daily-releases-nightlies Please try it out and let us know if you have any feature requests / run into any issues. |
Overall better vLLM integration would be great! A couple questions and some comments: GPUs are our main constraint and models are large and take some time to load. If we have a given model deployed on a (set of) GPU(s), then I'd like that vLLM instance to be maxed out before the cluster tries to allocate more GPUs. Currently we run vLLM externally to Ray and make all calls via REST api. We've also largely given up on GPU scaling due to cloud provisioning problems, so autoscaling is still more wish than reality. |
We implemented vLLM engine as an UDF of Ray Data .map_batches(), so 1) the resources that vLLM can use are allocated by Ray Data; 2) it's not launching an API server and use REST. Ray Data directly creates a vLLM engine object and use it in place. Pseudo code: class UDF:
def __init__(self, ...):
self.llm = vllm.engine(...)
async def __call__(self, batch):
return await sefl.llm(batch)
dataset = dataset.map_batches(
UDF,
num_gpus=1, # One GPU per engine.
concurrency=N, # Launch N engines.
)
We are able to handle that (not landed at this moment) in the following way: dataset = dataset.map_batches(
UDF,
num_gpus=1, # One GPU per engine.
concurrency=(1, N) # Launch N engines, but start processing once there's one engine ready.
) |
RFC: LLM APIs for Ray Data and Ray Serve
Summary
This RFC proposes new APIs in Ray for leveraging Large Language Models (LLMs) effectively within the Ray ecosystem, specifically introducing integrations for Ray Serve and Ray Data with vLLM and OpenAI.
Motivation
As LLMs become increasingly central to modern AI infrastructure deployments, platforms require the ability to deploy and scale these models efficiently. Current Ray Data and Ray Serve have limited support for LLM deployments, where users have to manually configure and manage the underlying LLM engine.
This proposal aims to address these challenges by providing unified, production-ready APIs for both batch processing and serving of LLMs within Ray in
ray.data.llm
andray.serve.llm
.Ray Data LLM
ray.data.llm
introduces several key components:Design Principles:
You can also make calls to deployed models that have an OpenAI compatible API endpoint.
Ray Serve LLM
The new
ray.serve.llm
provides:These features allow users to deploy multiple LLM models together with a familiar Ray Serve API, while providing compatibility with the OpenAI API.
Below is a more comprehensive example of using the OpenAI API with Ray Serve.
And you can now use an OpenAI API client to interact with the deployed models.
Advanced Example: RAG Pipeline with ray.data.llm
Here is a more complex example that demonstrates how to build a RAG pipeline using
the new
ray.data.llm
APIs.Advanced Example: LoRA serving with ray.serve.llm
ray.serve.llm config-based API
There is also a configuration-based API for serving LLMs, where
the configurations can be declared separately from the application logic.
Sample config.yaml
Future Work
cc @comaniac @kouroshHakha @akshay-anyscale @gvspraveen
The text was updated successfully, but these errors were encountered: