Implement Multi-Process Handling (#7) #15

mostlygeek · 2024-11-22T03:56:35Z

This change adds functionality for llama-swap to manage multiple llama-server backends. This is an advanced use case that requires enough GPU resources to load multiple models. An example is when programming and a larger model like qwen-coder-32B is used for code generation and qwen-coder-1.5B is used for auto-completion.

This implements the design from issue #7. A new groups configuration is introduced:

models:
  "qwen-coder-1.5b":
    cmd: llama-server --port 9001 -m models/qwen-coder-1.5b.gguf
    proxy: http://127.0.0.1:9001

  "qwen-coder-32b":
    cmd: llama-server --port 9002 -m models/qwen-coder-32B.gguf
    proxy: http://127.0.0.1:9002

groups:
  coding:
    - "qwen-coder-1.5b"
    - "qwen-coder-32b"

When a request to /v1/chat/completions with a {"model":"coding/qwen-coder-1.5b"} is sent llama-swap will load both models into the GPU(s). When a request for "coding/qwen-coder-32b" is requested the request will be served immediately. This can greatly reduces the time to first token for a request for both models in the configuration.

It is up to the operator to ensure the group configuration is viable. Generally watch out for these things:

there is sufficient GPU resources to load all models
there are no port conflicts in the configurations

mostlygeek · 2024-11-24T03:43:25Z

Renamed groups to profiles as this makes more sense when talking about it.

mostlygeek added 6 commits November 20, 2024 16:12

refactor proxy tests to get ready for multi-process support

8fcb164

Update proxy/ProxyManager to support multiple processes (#7)

04e2abe

Add support for Groups in configuration

daf5b96

improve handling of Model alias configs

f0e2f4d

implement multi-model swapping

4af3dca

improve code clarity for swapModel

522ebe0

mostlygeek added the enhancement New feature or request label Nov 22, 2024

improve docs, rename groups to profiles in config

054dd88

mostlygeek merged commit 73ad85e into main Nov 24, 2024

mostlygeek deleted the multi-process branch November 24, 2024 03:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Multi-Process Handling (#7) #15

Implement Multi-Process Handling (#7) #15

mostlygeek commented Nov 22, 2024 •

edited

Loading

mostlygeek commented Nov 24, 2024

Implement Multi-Process Handling (#7) #15

Implement Multi-Process Handling (#7) #15

Conversation

mostlygeek commented Nov 22, 2024 • edited Loading

mostlygeek commented Nov 24, 2024

mostlygeek commented Nov 22, 2024 •

edited

Loading