Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Multi-Process Handling (#7) #15

Merged
merged 7 commits into from
Nov 24, 2024
Merged

Implement Multi-Process Handling (#7) #15

merged 7 commits into from
Nov 24, 2024

Conversation

mostlygeek
Copy link
Owner

@mostlygeek mostlygeek commented Nov 22, 2024

This change adds functionality for llama-swap to manage multiple llama-server backends. This is an advanced use case that requires enough GPU resources to load multiple models. An example is when programming and a larger model like qwen-coder-32B is used for code generation and qwen-coder-1.5B is used for auto-completion.

This implements the design from issue #7. A new groups configuration is introduced:

models:
  "qwen-coder-1.5b":
    cmd: llama-server --port 9001 -m models/qwen-coder-1.5b.gguf
    proxy: http://127.0.0.1:9001

  "qwen-coder-32b":
    cmd: llama-server --port 9002 -m models/qwen-coder-32B.gguf
    proxy: http://127.0.0.1:9002

groups:
  coding:
    - "qwen-coder-1.5b"
    - "qwen-coder-32b"

When a request to /v1/chat/completions with a {"model":"coding/qwen-coder-1.5b"} is sent llama-swap will load both models into the GPU(s). When a request for "coding/qwen-coder-32b" is requested the request will be served immediately. This can greatly reduces the time to first token for a request for both models in the configuration.

It is up to the operator to ensure the group configuration is viable. Generally watch out for these things:

  1. there is sufficient GPU resources to load all models
  2. there are no port conflicts in the configurations

@mostlygeek mostlygeek added the enhancement New feature or request label Nov 22, 2024
@mostlygeek
Copy link
Owner Author

Renamed groups to profiles as this makes more sense when talking about it.

@mostlygeek mostlygeek merged commit 73ad85e into main Nov 24, 2024
@mostlygeek mostlygeek deleted the multi-process branch November 24, 2024 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant