-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support routing to multiple backends #7
Comments
To not introduce a lot more complexity it is desirable to keep using API paths The configuration file can be like: # Seconds to wait for llama.cpp to be available to serve requests
# Default (and minimum): 15 seconds
healthCheckTimeout: 15
models:
"qwen-coder-1b":
cmd: llama-server --port 9001 -m models/qwen-coder-0.5b.gguf
proxy: http://127.0.0.1:9001
"qwen-coder-32b":
cmd: llama-server --port 9002 -m models/qwen-coder-32B.gguf
proxy: http://127.0.0.1:9002
groups:
coding:
- "qwen-coder-1b"
- "qwen-coder-32b" When calling with There are some tradeoffs between configuration and complexity with this approach:
|
Refactor code to support starting of multiple back end llama.cpp servers. This functionality is exposed as `profiles` to create a simple configuration format. Changes: * refactor proxy tests to get ready for multi-process support * update proxy/ProxyManager to support multiple processes (#7) * Add support for Groups in configuration * improve handling of Model alias configs * implement multi-model swapping * improve code clarity for swapModel * improve docs, rename groups to profiles in config
For multi-gpu machines, be able to load multiple inference backends and route to them appropriately. For example:
The use case is to have more control and better utilization of local resources. Additionally, this would work well for software development where a larger model can be used for chat and a smaller, faster model for auto-complete.
With nvidia GPUs (on linux) this can be done using
CUDA_VISIBLE_DEVICES
in the environment.The text was updated successfully, but these errors were encountered: