A demo application showcasing how to run a local LLM on your own hardware. Includes samples that leverage open-source libraries (llama.cpp) and models (llama), as well as documentation from Nexus Dashboard.
First clone the project and navigate into project directory
git clone https://github.com/ndavidson19/ciscolive.git
cd ciscolive/ciscolive-demo/documentation-llm
Next you must download the modelfile. Huggingface has so many models to choose from and all have very elaborate names. We will be choosing a DPO finetuned version of StableLM. This small 3B model punches above its weight when it comes to RAG applications.
Next create a directory called llm in the backend folder
cd /cisco-live/documentation-llm/backend
mkdir llm
Then move the modelfile to the correct directory ciscolive/ciscolive-demo/documentation-llm/backend/llm/dolphin-2.6-mistral-7b-dpo-laser.Q4_K_M.gguf
cd /cisco-live/documentation-llm/backend
mkdir llm
This entire application has been dockerized and can be run with just
docker-compose up --build
This starts three different services.
- The Vector Datastore (pgvector)
- This pulls a postgres image from ankane/pgvector that installs the correct extensions for allow vectors within postgres.
- The Flask serving APIs and VectorDB insertion
- This service starts a flask API endpoint route (/get_message) on port :5000 that allows for a user to send queries to the LLM being served using LlamaCPP (https://github.com/abetlen/llama-cpp-python) using script at /backend/main.py
- This service also parses the pdf living in /training/pdfs/ using /training/pdf.py and then inserts it into the database using /training/db-embeddings.py
- The UI service
- Uses nginx to start a basic webserver for the basic index.html file
Note: This is a very simplistic scaled down version of our full architecture we are running in production and should be treated as a starting point. Look into the llama-cpp-python OpenAI compatible webserver if you are going to be creating your own application.
It is recommended to create a virtual-env before installing dependencies. Or use a dependency manager such as anaconda. Ex.
python3 -m venv venv_name
source venv_name/bin/activate
pip install -r requirements.txt
Next you must download the modelfile. Rocket 3B
Next move the modelfile to the correct directory /cisco-live/documentation-llm/backend/llm/llama-2-7b-chat.Q4_K_M.gguf
cd /cisco-live/documentation-llm/backend
mkdir llm
-
Database Setup:
- Run the PostgreSQL vector extension for embeddings:
docker pull ankane/pgvector docker run -p 5432:5432 -e POSTGRES_PASSWORD=secret -e POSTGRES_USER=postgres ankane/pgvector
- Run the PostgreSQL vector extension for embeddings:
-
Training Pipeline:
- Navigate to the
training
directory. - Run
pdf.py
to parse PDFs anddb-embeddings.py
to store embeddings:python pdf.py python db-embeddings.py
- Navigate to the
-
Start the Backend:
- Use llama-cpp-python OpenAI compatible webserver for managing model serving.
python3 -m llama_cpp.server --config_file /<USER_PATH>/documentation-llm/backend/llm/config.json
- Start the backend services located in
backend/inference
:python main.py
- Use llama-cpp-python OpenAI compatible webserver for managing model serving.
Run the below command in the root directory of the project.
python -m http.server
Navigate to http://localhost:8000/ in your browser. To load the UI you just need to open the index.html file that lives in the cisco-live/documentation-llm/ui directory.
You should be all set to start asking questions!
A license is required for others to be able to use your code. An open source license is more than just a usage license, it is license to contribute and collaborate on code. Open sourcing code and contributing it to Code Exchange requires a commitment to maintain the code and help the community use and contribute to the code. More about open-source licenses