Skip to content

lihuahua123/jointserve

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Preble

Preble is a load balancer for effecient prefix caching systems. PrePrint release at https://arxiv.org/abs/2407.00023

Installation

You can install the package using pip:

Code Structure

The multi_node directory contains the code for running as a separate abstraction layer to SGLang/vLLM in a distributed setting. This code is responsible for coordinating and managing the execution of the distributed system.

Editable Installation

pip3 install -e .
pip install -e "python[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/

Regular Pip Installation:

pip3 install preble
pip install git+https://github.com/wuklab/preble.git#egg=preble[all]
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/

We release a custom version of sglang that supports chunked prefill

Programatically starting the server

We can support providing a list of runtime urls

from preble.main import start_server

start_server(
    runtime_selection_policy="custom",
    runtime_urls="http://127.0.0.1:30000/generate,http://127.0.0.1:30001/generate",
    host='127.0.0.1',
    port=8000,
    model="mistralai/Mistral-7B-v0.1"
)

We can also support dynamically loading the models to seperate cuda devices

from preble.main import start_server_and_load_models

start_server_and_load_models(
    model_name="mistralai/Mistral-7B-v0.1",
    devices=[0, 1],
    host="127.0.0.1",
    port=8000
)

The server can be run via:

python3 multi_node/server/server.py <server/deploy_and_run>
  • server runs the server given a list of urls
  • deploy_and_run generates two endpoints

CLI Configuration

    runtime_selection_policy: The policy to select the runtime (e.g., custom, round_robin).
    runtime_urls: Comma-separated list of runtime URLs.
    host: The host address for the server.
    port: The port number for the server.
    model: The model to be used (e.g., mistralai/Mistral-7B-v0.1).

Citation And Acknowledgment

The code is forked of sglang

pypi build and install instructions

Currently uploaded at: python setup.py bdist_wheel twine upload --repository testpypi dist/* --verbose python3 -m pip install --index-url https://test.pypi.org/simple/ preble

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published