FlexLLama is a lightweight, extensible, and user-friendly self-hosted tool that easily runs multiple llama.cpp server instances with OpenAI v1 API compatibility. It's designed to manage multiple models across different GPUs, making it a powerful solution for local AI development and deployment.
- π Multiple llama.cpp instances - Run different models simultaneously
- π― Multi-GPU support - Distribute models across different GPUs
- π OpenAI v1 API compatible - Drop-in replacement for OpenAI endpoints
- π Real-time dashboard - Monitor model status with a web interface
- π€ Chat & Completions - Full chat and text completion support
- π Embeddings & Reranking - Supports models for embeddings and reranking
- β‘ Auto-start - Automatically start default runners on launch
- π Model switching - Dynamically load/unload models as needed
π Want to get started in 5 minutes? Check out our QUICKSTART.md for a simple Docker setup with the Qwen3-4B model!
-
Install FlexLLama:
From GitHub:
pip install git+https://github.com/yazon/flexllama.git
From local source (after cloning):
# git clone https://github.com/yazon/flexllama.git # cd flexllama pip install .
-
Create your configuration: Copy the example configuration file to create your own. If you installed from a local clone, you can run:
cp config_example.json config.json
If you installed from git, you may need to download it from the repository.
-
Edit
config.json: Updateconfig.jsonwith the correct paths for yourllama-serverbinary and your model files (.gguf). -
Run FlexLLama:
python main.py config.json
or
flexllama config.json
-
Open dashboard:
http://localhost:8080
FlexLLama can be run using Docker and Docker Compose. We provide profiles for both CPU-only and GPU-accelerated (NVIDIA CUDA) environments.
-
Clone the repository:
git clone https://github.com/yazon/flexllama.git cd flexllama
After cloning, you can proceed with the quick start script or a manual setup.
For an easier start, the docker-start.sh helper script automates several setup steps. It checks your Docker environment, builds the correct image (CPU or GPU) and provides the commands to launch FlexLLama.
-
Make the script executable (Linux/Unix):
chmod +x docker-start.sh
-
Run the script: Use the
--gpuflag for NVIDIA GPU support.For CPU-only setup:
./docker-start.sh
or
./docker-start.ps1
For GPU-accelerated setup:
./docker-start.sh --gpu
or
./docker-start.ps1 -gpu
-
Follow the on-screen instructions: The script will guide you.
Manual Docker and Docker Compose Setup
If you prefer to run the steps manually, follow this guide:
-
Place your models:
# Create the models directory if it doesn't exist mkdir -p models # Copy your .gguf model files into it cp /path/to/your/model.gguf models/
-
Configure your models:
# Edit the Docker configuration to point to your models # β’ CPU-only: keep "n_gpu_layers": 0 # β’ GPU: set "n_gpu_layers" to e.g. 99 and specify "main_gpu": 0
-
Build and Start FlexLLama with Docker Compose (Recommended): Use the
--profileflag to select your environment. The service will be available athttp://localhost:8080.For CPU-only:
docker compose --profile cpu up --build -d
For GPU support (NVIDIA CUDA):
docker compose --profile gpu up --build -d
-
View Logs To monitor the output of your services, you can view their logs in real-time.
For the CPU service:
docker compose --profile cpu logs -f
For the GPU service:
docker compose --profile gpu logs -f
(Press
Ctrl+Cto stop viewing the logs.) -
(Alternative) Using
docker run: You can also build and run the containers manually.For CPU-only:
# Build the image docker build -t flexllama:latest . # Run the container docker run -d -p 8080:8080 \ -v $(pwd)/models:/app/models:ro \ -v $(pwd)/docker/config.json:/app/config.json:ro \ flexllama:latest
For GPU support (NVIDIA CUDA):
# Build the image docker build -f Dockerfile.cuda -t flexllama-gpu:latest . # Run the container docker run -d --gpus all -p 8080:8080 \ -v $(pwd)/models:/app/models:ro \ -v $(pwd)/docker/config.json:/app/config.json:ro \ flexllama-gpu:latest
-
Open the dashboard: Access the FlexLLama dashboard in your browser:
http://localhost:8080
Edit config.json to configure your runners and models:
{
"auto_start_runners": true,
"api": {
"host": "0.0.0.0",
"port": 8080,
"health_endpoint": "/health"
},
"runner1": {
"type": "llama-server",
"path": "/path/to/llama-server",
"host": "127.0.0.1",
"port": 8085,
"inherit_env": true,
"env": {}
},
"models": [
{
"runner": "runner1",
"model": "/path/to/model.gguf",
"model_alias": "my-model",
"n_ctx": 4096,
"n_gpu_layers": 99,
"main_gpu": 0
}
]
}{
"runner_gpu0": {
"path": "/path/to/llama-server",
"port": 8085,
"inherit_env": true,
"env": {}
},
"runner_gpu1": {
"path": "/path/to/llama-server",
"port": 8086,
"inherit_env": true,
"env": {}
},
"models": [
{
"runner": "runner_gpu0",
"model": "/path/to/chat-model.gguf",
"model_alias": "chat-model",
"main_gpu": 0,
"n_gpu_layers": 99
},
{
"runner": "runner_gpu1",
"model": "/path/to/embedding-model.gguf",
"model_alias": "embedding-model",
"embedding": true,
"main_gpu": 1,
"n_gpu_layers": 99
}
]
}FlexLLama supports automatic model unloading to free up RAM when models are idle. This is useful for managing memory usage when running multiple models.
{
"runner_memory_saver": {
"path": "/path/to/llama-server",
"port": 8085,
"auto_unload_timeout_seconds": 300
},
"runner_always_on": {
"path": "/path/to/llama-server",
"port": 8086,
"auto_unload_timeout_seconds": 0
},
"models": [
{
"runner": "runner_memory_saver",
"model": "/path/to/large-model.gguf",
"model_alias": "large-model"
},
{
"runner": "runner_always_on",
"model": "/path/to/small-model.gguf",
"model_alias": "small-model"
}
]
}Auto-unload Behavior:
auto_unload_timeout_seconds: 0- Disables auto-unload (default)auto_unload_timeout_seconds: 300- Unloads model after 5 minutes of inactivity- Models are considered "active" while processing requests (including streaming)
- The timeout is measured from the last request completion
- Auto-unload frees RAM by stopping the runner process entirely
- Models will be automatically reloaded when the next request arrives
FlexLLama supports setting environment variables for runners and individual models. This is useful for configuring GPU devices, library paths, or other runtime settings.
{
"runner_vulkan": {
"type": "llama-server",
"path": "/path/to/llama-server",
"port": 8085,
"inherit_env": true,
"env": {
"GGML_VULKAN_DEVICE": "1",
"RUNNER_SPECIFIC_VAR": "value"
}
},
"models": [
{
"runner": "runner_vulkan",
"model": "/path/to/model.gguf",
"model_alias": "my-model",
"env": {
"MODEL_SPECIFIC_VAR": "override"
}
}
]
}Runner Options:
path: Path to llama-server binaryhost/port: Where to run this instanceinherit_env: Whether to inherit parent environment variables (default:true)env: Dictionary of environment variables to set for all models on this runnerextra_args: Additional arguments for llama-server (applied to all models using this runner)auto_unload_timeout_seconds: Automatically unload model after this many seconds of inactivity (0 disables, default: 0)
Model Options:
Core Settings:
runner: Which runner to use for this modelmodel: Path to .gguf model filemodel_alias: Name to use in API callsinherit_env: Override runner's inherit_env setting for this model (optional)env: Dictionary of environment variables specific to this model (overrides runner env)
Model Types:
embedding: Set totruefor embedding modelsreranking: Set totruefor reranking modelsmmproj: Path to multimodal projection file (for vision models)
Performance & Memory:
n_ctx: Context window size (e.g., 4096, 8192, 32768)n_batch: Batch size for processing (e.g., 256, 512)n_threads: Number of CPU threads to usemain_gpu: Which GPU to use (0, 1, 2...)n_gpu_layers: How many layers to offload to GPU (99 for all layers)tensor_split: Array defining how to split model across GPUs (e.g., [1.0, 0.0])offload_kqv: Whether to offload key-value cache to GPU (true/false)use_mlock: Lock model in RAM to prevent swapping (true/false)
Optimization:
flash_attn: Flash attention mode -"on","off", or"auto"(case-sensitive). Boolean values (true/false) are deprecated but still supported for backwards compatibility.split_mode: How to split model layers ("row" or other modes)cache-type-k: Key cache quantization type (e.g., "q8_0")cache-type-v: Value cache quantization type (e.g., "q8_0")
Chat & Templates:
chat_template: Chat template format (e.g., "mistral-instruct", "gemma")jinja: Enable Jinja templating (true/false)
Advanced Options:
rope-scaling: RoPE scaling method (e.g., "linear")rope-scale: RoPE scaling factor (e.g., 2)yarn-orig-ctx: Original context size for YaRN scalingpooling: Pooling method for embeddings (e.g., "cls")args: Additional custom arguments to pass directly to llama-server for this specific model (string, e.g., "--custom-flag --param value"). These are applied after all other model parameters and before runnerextra_args.
You can validate your configuration file and run a suite of tests to ensure the application is working correctly.
To validate your config.json file, run config.py and provide the path to your configuration file. This will check for correct formatting and required fields.
python backend/config.py config.jsonA successful validation will print a confirmation message. If there are errors, they will be displayed with details on how to fix them.
The tests/ directory contains scripts for different testing purposes. All test scripts generate detailed logs in the tests/logs/{session_id}/ directory.
Prerequisites:
- For
test_basic.pyandtest_all_models.py, the main application must be running (flexllama config.json). - For
test_model_switching.py, the main application should not be running.
test_basic.py performs basic checks on the API endpoints to ensure they are responsive.
# Run basic tests against the default URL (http://localhost:8080)
python tests/test_basic.pyWhat it tests:
/v1/modelsand/healthendpoints/v1/chat/completionswith both regular and streaming responses- Concurrent request handling
test_all_models.py runs a comprehensive test suite against every model defined in your config.json.
# Test all configured models
python tests/test_all_models.py config.jsonWhat it tests:
- Model loading and health checks
- Chat completions (regular and streaming) for each model
- Response time and error handling
test_model_switching.py verifies the dynamic loading and unloading of models.
# Run model switching tests
python tests/test_model_switching.py config.jsonWhat it tests:
- Dynamic model loading and switching
- Runner state management and health monitoring
- Proper cleanup of resources
This project is licensed under the BSD-3-Clause License. See the LICENSE file for details.
π Ready to run multiple LLMs like a pro? Edit your config.json and start FlexLLama!

