Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support routing to multiple backends #7

Open
mostlygeek opened this issue Oct 30, 2024 · 1 comment
Open

Support routing to multiple backends #7

mostlygeek opened this issue Oct 30, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@mostlygeek
Copy link
Owner

mostlygeek commented Oct 30, 2024

For multi-gpu machines, be able to load multiple inference backends and route to them appropriately. For example:

                               +----------+              
         +-------------------> | qwen-72b |              
         |                     +----------+              
         |                        running on GPU #1,#2,#3
+--------+---+                                           
| llama-swap |                                           
+--------+---+                                           
         |                                               
         |                     +---------------+         
         +-------------------> | qwen-coder-7b |         
                               +---------------+         
                                  running on GPU #4      

The use case is to have more control and better utilization of local resources. Additionally, this would work well for software development where a larger model can be used for chat and a smaller, faster model for auto-complete.

With nvidia GPUs (on linux) this can be done using CUDA_VISIBLE_DEVICES in the environment.

@mostlygeek
Copy link
Owner Author

mostlygeek commented Nov 21, 2024

To not introduce a lot more complexity it is desirable to keep using API paths /v1/chat/completions and /v1/completions. So current design is to encode the group into the model name like: group/model or coding/qwen-coder-32b.

The configuration file can be like:

# Seconds to wait for llama.cpp to be available to serve requests
# Default (and minimum): 15 seconds
healthCheckTimeout: 15

models:
  "qwen-coder-1b":
    cmd: llama-server --port 9001 -m models/qwen-coder-0.5b.gguf
    proxy: http://127.0.0.1:9001

  "qwen-coder-32b":
    cmd: llama-server --port 9002 -m models/qwen-coder-32B.gguf
    proxy: http://127.0.0.1:9002

groups:
  coding:
    - "qwen-coder-1b"
    - "qwen-coder-32b"

When calling with coding/qwen-coder-1b llama-swap will make sure the whole coding group is loaded. So calls to coding/qwen-coder-32b will be routed to an already running server. However, when calling with qwen-coder-1b it will unload the whole group and load only the single model.

There are some tradeoffs between configuration and complexity with this approach:

  • it is simple to manage and understand
  • the listening addresses across grouped models can not conflict
  • there is no customization in the group settings for the models, a new model definition would need to be created

mostlygeek added a commit that referenced this issue Nov 24, 2024
Refactor code to support starting of multiple back end llama.cpp servers. This functionality is exposed as `profiles` to create a simple configuration format. 

Changes: 

* refactor proxy tests to get ready for multi-process support
* update proxy/ProxyManager to support multiple processes (#7)
* Add support for Groups in configuration
* improve handling of Model alias configs
* implement multi-model swapping
* improve code clarity for swapModel
* improve docs, rename groups to profiles in config
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant