FlexLLama - "One to rule them all"

FlexLLama is a lightweight, extensible, and user-friendly self-hosted tool that easily runs multiple llama.cpp server instances with OpenAI v1 API compatibility. It's designed to manage multiple models across different GPUs, making it a powerful solution for local AI development and deployment.

Key Features of FlexLLama

🚀 Multiple llama.cpp instances - Run different models simultaneously
🎯 Multi-GPU support - Distribute models across different GPUs
🔌 OpenAI v1 API compatible - Drop-in replacement for OpenAI endpoints
📊 Real-time dashboard - Monitor model status with a web interface
🤖 Chat & Completions - Full chat and text completion support
🔍 Embeddings & Reranking - Supports models for embeddings and reranking
⚡ Auto-start - Automatically start default runners on launch
🔄 Model switching - Dynamically load/unload models as needed

Quickstart

🚀 Want to get started in 5 minutes? Check out our QUICKSTART.md for a simple Docker setup with the Qwen3-4B model!

📦 Local Installation

Install FlexLLama:

From GitHub:

pip install git+https://github.com/yazon/flexllama.git

From local source (after cloning):

# git clone https://github.com/yazon/flexllama.git
# cd flexllama
pip install .

Create your configuration: Copy the example configuration file to create your own. If you installed from a local clone, you can run:
```
cp config_example.json config.json
```
If you installed from git, you may need to download it from the repository.
Edit config.json: Update config.json with the correct paths for your llama-server binary and your model files (.gguf).

Run FlexLLama:

python main.py config.json

or

flexllama config.json

Open dashboard:
```
http://localhost:8080
```

🐳 Docker

FlexLLama can be run using Docker and Docker Compose. We provide profiles for both CPU-only and GPU-accelerated (NVIDIA CUDA) environments.

Clone the repository:

git clone https://github.com/yazon/flexllama.git
cd flexllama

After cloning, you can proceed with the quick start script or a manual setup.

Using the Quick Start Script (`docker-start.sh`)

For an easier start, the docker-start.sh helper script automates several setup steps. It checks your Docker environment, builds the correct image (CPU or GPU) and provides the commands to launch FlexLLama.

Make the script executable (Linux/Unix):
```
chmod +x docker-start.sh
```
Run the script: Use the --gpu flag for NVIDIA GPU support.

For CPU-only setup:
```
./docker-start.sh
```
or
```
./docker-start.ps1
```
For GPU-accelerated setup:
```
./docker-start.sh --gpu
```
or
```
./docker-start.ps1 -gpu
```
Follow the on-screen instructions: The script will guide you.

Manual Docker and Docker Compose Setup

If you prefer to run the steps manually, follow this guide:

Place your models:

# Create the models directory if it doesn't exist
mkdir -p models
# Copy your .gguf model files into it
cp /path/to/your/model.gguf models/

Configure your models:

# Edit the Docker configuration to point to your models
#   • CPU-only: keep "n_gpu_layers": 0
#   • GPU: set "n_gpu_layers" to e.g. 99 and specify "main_gpu": 0

Build and Start FlexLLama with Docker Compose (Recommended): Use the --profile flag to select your environment. The service will be available at http://localhost:8080.

For CPU-only:
```
docker compose --profile cpu up --build -d
```
For GPU support (NVIDIA CUDA):
```
docker compose --profile gpu up --build -d
```
View Logs To monitor the output of your services, you can view their logs in real-time.

For the CPU service:
```
docker compose --profile cpu logs -f
```
For the GPU service:
```
docker compose --profile gpu logs -f
```
(Press Ctrl+C to stop viewing the logs.)

(Alternative) Using docker run: You can also build and run the containers manually.

For CPU-only:

# Build the image
docker build -t flexllama:latest .
# Run the container
docker run -d -p 8080:8080 \
  -v $(pwd)/models:/app/models:ro \
  -v $(pwd)/docker/config.json:/app/config.json:ro \
  flexllama:latest

For GPU support (NVIDIA CUDA):

# Build the image
docker build -f Dockerfile.cuda -t flexllama-gpu:latest .
# Run the container
docker run -d --gpus all -p 8080:8080 \
  -v $(pwd)/models:/app/models:ro \
  -v $(pwd)/docker/config.json:/app/config.json:ro \
  flexllama-gpu:latest

Open the dashboard: Access the FlexLLama dashboard in your browser: http://localhost:8080

Configuration

Edit config.json to configure your runners and models:

Basic Structure

{
    "auto_start_runners": true,
    "api": {
        "host": "0.0.0.0",
        "port": 8080,
        "health_endpoint": "/health"
    },
    "runner1": {
        "type": "llama-server",
        "path": "/path/to/llama-server",
        "host": "127.0.0.1",
        "port": 8085,
        "inherit_env": true,
        "env": {}
    },
    "models": [
        {
            "runner": "runner1",
            "model": "/path/to/model.gguf",
            "model_alias": "my-model",
            "n_ctx": 4096,
            "n_gpu_layers": 99,
            "main_gpu": 0
        }
    ]
}

Multi-GPU Setup

{
    "runner_gpu0": {
        "path": "/path/to/llama-server",
        "port": 8085,
        "inherit_env": true,
        "env": {}
    },
    "runner_gpu1": {
        "path": "/path/to/llama-server", 
        "port": 8086,
        "inherit_env": true,
        "env": {}
    },
    "models": [
        {
            "runner": "runner_gpu0",
            "model": "/path/to/chat-model.gguf",
            "model_alias": "chat-model",
            "main_gpu": 0,
            "n_gpu_layers": 99
        },
        {
            "runner": "runner_gpu1",
            "model": "/path/to/embedding-model.gguf",
            "model_alias": "embedding-model",
            "embedding": true,
            "main_gpu": 1,
            "n_gpu_layers": 99
        }
    ]
}

Auto-unload Configuration

FlexLLama supports automatic model unloading to free up RAM when models are idle. This is useful for managing memory usage when running multiple models.

{
    "runner_memory_saver": {
        "path": "/path/to/llama-server",
        "port": 8085,
        "auto_unload_timeout_seconds": 300
    },
    "runner_always_on": {
        "path": "/path/to/llama-server",
        "port": 8086,
        "auto_unload_timeout_seconds": 0
    },
    "models": [
        {
            "runner": "runner_memory_saver",
            "model": "/path/to/large-model.gguf",
            "model_alias": "large-model"
        },
        {
            "runner": "runner_always_on",
            "model": "/path/to/small-model.gguf",
            "model_alias": "small-model"
        }
    ]
}

Auto-unload Behavior:

auto_unload_timeout_seconds: 0 - Disables auto-unload (default)
auto_unload_timeout_seconds: 300 - Unloads model after 5 minutes of inactivity
Models are considered "active" while processing requests (including streaming)
The timeout is measured from the last request completion
Auto-unload frees RAM by stopping the runner process entirely
Models will be automatically reloaded when the next request arrives

Environment Variables

FlexLLama supports setting environment variables for runners and individual models. This is useful for configuring GPU devices, library paths, or other runtime settings.

{
    "runner_vulkan": {
        "type": "llama-server",
        "path": "/path/to/llama-server",
        "port": 8085,
        "inherit_env": true,
        "env": {
            "GGML_VULKAN_DEVICE": "1",
            "RUNNER_SPECIFIC_VAR": "value"
        }
    },
    "models": [
        {
            "runner": "runner_vulkan",
            "model": "/path/to/model.gguf",
            "model_alias": "my-model",
            "env": {
                "MODEL_SPECIFIC_VAR": "override"
            }
        }
    ]
}

Key Configuration Options

Runner Options:

path: Path to llama-server binary
host/port: Where to run this instance
inherit_env: Whether to inherit parent environment variables (default: true)
env: Dictionary of environment variables to set for all models on this runner
extra_args: Additional arguments for llama-server (applied to all models using this runner)
auto_unload_timeout_seconds: Automatically unload model after this many seconds of inactivity (0 disables, default: 0)

Model Options:

Core Settings:

runner: Which runner to use for this model
model: Path to .gguf model file
model_alias: Name to use in API calls
inherit_env: Override runner's inherit_env setting for this model (optional)
env: Dictionary of environment variables specific to this model (overrides runner env)

Model Types:

embedding: Set to true for embedding models
reranking: Set to true for reranking models
mmproj: Path to multimodal projection file (for vision models)

Performance & Memory:

n_ctx: Context window size (e.g., 4096, 8192, 32768)
n_batch: Batch size for processing (e.g., 256, 512)
n_threads: Number of CPU threads to use
main_gpu: Which GPU to use (0, 1, 2...)
n_gpu_layers: How many layers to offload to GPU (99 for all layers)
tensor_split: Array defining how to split model across GPUs (e.g., [1.0, 0.0])
offload_kqv: Whether to offload key-value cache to GPU (true/false)
use_mlock: Lock model in RAM to prevent swapping (true/false)

Optimization:

flash_attn: Flash attention mode - "on", "off", or "auto" (case-sensitive). Boolean values (true/false) are deprecated but still supported for backwards compatibility.
split_mode: How to split model layers ("row" or other modes)
cache-type-k: Key cache quantization type (e.g., "q8_0")
cache-type-v: Value cache quantization type (e.g., "q8_0")

Chat & Templates:

chat_template: Chat template format (e.g., "mistral-instruct", "gemma")
jinja: Enable Jinja templating (true/false)

Advanced Options:

rope-scaling: RoPE scaling method (e.g., "linear")
rope-scale: RoPE scaling factor (e.g., 2)
yarn-orig-ctx: Original context size for YaRN scaling
pooling: Pooling method for embeddings (e.g., "cls")
args: Additional custom arguments to pass directly to llama-server for this specific model (string, e.g., "--custom-flag --param value"). These are applied after all other model parameters and before runner extra_args.

Testing

You can validate your configuration file and run a suite of tests to ensure the application is working correctly.

Validating `config.json`

To validate your config.json file, run config.py and provide the path to your configuration file. This will check for correct formatting and required fields.

python backend/config.py config.json

A successful validation will print a confirmation message. If there are errors, they will be displayed with details on how to fix them.

Running Tests

The tests/ directory contains scripts for different testing purposes. All test scripts generate detailed logs in the tests/logs/{session_id}/ directory.

Prerequisites:

For test_basic.py and test_all_models.py, the main application must be running (flexllama config.json).
For test_model_switching.py, the main application should not be running.

Basic API Tests

test_basic.py performs basic checks on the API endpoints to ensure they are responsive.

# Run basic tests against the default URL (http://localhost:8080)
python tests/test_basic.py

What it tests:

/v1/models and /health endpoints
/v1/chat/completions with both regular and streaming responses
Concurrent request handling

All Models Test

test_all_models.py runs a comprehensive test suite against every model defined in your config.json.

# Test all configured models
python tests/test_all_models.py config.json

What it tests:

Model loading and health checks
Chat completions (regular and streaming) for each model
Response time and error handling

Model Switching Test

test_model_switching.py verifies the dynamic loading and unloading of models.

# Run model switching tests
python tests/test_model_switching.py config.json

What it tests:

Dynamic model loading and switching
Runner state management and health monitoring
Proper cleanup of resources

License

This project is licensed under the BSD-3-Clause License. See the LICENSE file for details.

🚀 Ready to run multiple LLMs like a pro? Edit your config.json and start FlexLLama!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FlexLLama - "One to rule them all"

Key Features of FlexLLama

Quickstart

📦 Local Installation

🐳 Docker

Using the Quick Start Script (`docker-start.sh`)

Configuration

Basic Structure

Multi-GPU Setup

Auto-unload Configuration

Environment Variables

Key Configuration Options

Testing

Validating `config.json`

Running Tests

Basic API Tests

All Models Test

Model Switching Test

License

About

Uh oh!

Releases 9

Packages

Uh oh!

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
backend		backend
docker		docker
frontend		frontend
static		static
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.cuda		Dockerfile.cuda
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
config_example.json		config_example.json
docker-compose.yml		docker-compose.yml
docker-start.ps1		docker-start.ps1
docker-start.sh		docker-start.sh
main.py		main.py
pyproject.toml		pyproject.toml

License

yazon/flexllama

Folders and files

Latest commit

History

Repository files navigation

FlexLLama - "One to rule them all"

Key Features of FlexLLama

Quickstart

📦 Local Installation

🐳 Docker

Using the Quick Start Script (docker-start.sh)

Configuration

Basic Structure

Multi-GPU Setup

Auto-unload Configuration

Environment Variables

Key Configuration Options

Testing

Validating config.json

Running Tests

Basic API Tests

All Models Test

Model Switching Test

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Contributors 2

Languages

Using the Quick Start Script (`docker-start.sh`)

Validating `config.json`

Packages