Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Adds initial support for
Llama-cpp
LLM for running local models. This enables the model to be used but streaming and some other things don't work exactly right yet.Setup
download models:
Tested with GGUF based models from huggingface
save model to
<ix root>/llama/<model>
create LLM + chain
set
model_path
to/var/app/ix/llama/<model>
Changes
Adds
LLAMA_CPP_LLM
How Tested
TODOs
Streaming isn't working with
LLAMA_CPP_LLM
:IxHandler
isn't receiving all kwargs the model is initialized with so can't tell if streaming was enabled. This is a potential blocker. LLAMA_CPP appears to be intentionally filtering these out frominvocation params
IxHandler
needs to be updated to start streaming for LLMs (only supported for chat models right now)chain.astream()
if workaround can't be found.Docker image isn't setup for GPU acceleration. I made a short attempt at adding libraries to compile GPU support with
ENV LLAMA_CUBLAS=1
but the required libraries weren't installed inpython:3.11
docker image. Wasn't readily apparent how to install the library. May require switching to a different base image with better support.