Support routing to multiple backends #7

mostlygeek · 2024-10-30T17:20:12Z

For multi-gpu machines, be able to load multiple inference backends and route to them appropriately. For example:

                               +----------+              
         +-------------------> | qwen-72b |              
         |                     +----------+              
         |                        running on GPU #1,#2,#3
+--------+---+                                           
| llama-swap |                                           
+--------+---+                                           
         |                                               
         |                     +---------------+         
         +-------------------> | qwen-coder-7b |         
                               +---------------+         
                                  running on GPU #4

The use case is to have more control and better utilization of local resources. Additionally, this would work well for software development where a larger model can be used for chat and a smaller, faster model for auto-complete.

With nvidia GPUs (on linux) this can be done using CUDA_VISIBLE_DEVICES in the environment.

The text was updated successfully, but these errors were encountered:

mostlygeek · 2024-11-21T00:46:53Z

To not introduce a lot more complexity it is desirable to keep using API paths /v1/chat/completions and /v1/completions. So current design is to encode the group into the model name like: group/model or coding/qwen-coder-32b.

The configuration file can be like:

# Seconds to wait for llama.cpp to be available to serve requests
# Default (and minimum): 15 seconds
healthCheckTimeout: 15

models:
  "qwen-coder-1b":
    cmd: llama-server --port 9001 -m models/qwen-coder-0.5b.gguf
    proxy: http://127.0.0.1:9001

  "qwen-coder-32b":
    cmd: llama-server --port 9002 -m models/qwen-coder-32B.gguf
    proxy: http://127.0.0.1:9002

groups:
  coding:
    - "qwen-coder-1b"
    - "qwen-coder-32b"

When calling with coding/qwen-coder-1b llama-swap will make sure the whole coding group is loaded. So calls to coding/qwen-coder-32b will be routed to an already running server. However, when calling with qwen-coder-1b it will unload the whole group and load only the single model.

There are some tradeoffs between configuration and complexity with this approach:

it is simple to manage and understand
the listening addresses across grouped models can not conflict
there is no customization in the group settings for the models, a new model definition would need to be created

Refactor code to support starting of multiple back end llama.cpp servers. This functionality is exposed as `profiles` to create a simple configuration format. Changes: * refactor proxy tests to get ready for multi-process support * update proxy/ProxyManager to support multiple processes (#7) * Add support for Groups in configuration * improve handling of Model alias configs * implement multi-model swapping * improve code clarity for swapModel * improve docs, rename groups to profiles in config

mostlygeek added the enhancement New feature or request label Oct 30, 2024

mostlygeek added a commit that referenced this issue Nov 21, 2024

Update proxy/ProxyManager to support multiple processes (#7)

04e2abe

mostlygeek mentioned this issue Nov 22, 2024

Implement Multi-Process Handling (#7) #15

Merged

mostlygeek closed this as completed Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support routing to multiple backends #7

Support routing to multiple backends #7

mostlygeek commented Oct 30, 2024 •

edited

Loading

mostlygeek commented Nov 21, 2024 •

edited

Loading

Support routing to multiple backends #7

Support routing to multiple backends #7

Comments

mostlygeek commented Oct 30, 2024 • edited Loading

mostlygeek commented Nov 21, 2024 • edited Loading

mostlygeek commented Oct 30, 2024 •

edited

Loading

mostlygeek commented Nov 21, 2024 •

edited

Loading