Open-source LLMs have impressive capabilities, but they are both large and expensive to train and operate.
To tackle this issue, I wanted to create a Proof of Concept (POC) using the Llama-2-7B model that utilises more efficient approaches to finetuning and deployment:
Use QLoRA to perform PEFT on Llama-7B model. This significantly reduces memory usage to enable LLM training on a single GPU, offering a more economical alternative. I demonstrate the instruction-tuning using a custom dataset on a free colab instance.
Run quantized LLM using open-source libraries such as llama.cpp and use only CPU for inference. I also deployed a containerized GGML quantized model to AWS Lambda using AWS CDK and FastAPI.
A Llama-7B pre-trained model was finetuned using the QLoRA technique on the dataset. The entire finetuning process was done entirely on Google Colab's free-tier using Nvidia T4.
Hugging Face Model | Base Model | Dataset | Colab | Remarks |
---|---|---|---|---|
llama2-7b-v2 | meta-llama/Llama-2-7b-chat-hf | Amod/mental_health_counseling_conversations | Latest | |
llama2-7b-v1 | meta-llama/Llama-2-7b-chat-hf | heliosbrahma/mental_health_conversational_dataset | Experimental |
To quantize the model, follow these steps:
Clone the llama.cpp
repository and install the required dependencies by following these commands:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
python3 -m pip install -r requirements.txt
Download the original LLaMA model weights and place them in the ./models
directory within our project's structure.
Execute the make_ggml.py script with the necessary arguments to perform the quantization. Use the following command:
python make_ggml.py /path/to/model model-name /path/to/output_directory
Replace the placeholders with actual paths and names:
- /path/to/model: Path to your model.
- model-name: Name of the model.
- /path/to/output_directory: Path to the output directory where the quantized model will be saved.
Deploy a container which can run the llama.cpp converted model onto AWS Lambda. We take reference from the OpenLlama on AWS Lambda project, which contains the AWS CDK code to create and deploy a Lambda function leveraging our model of choice, with a FastAPI frontend accessible from a Lambda URL. Note that the free-tier of AWS Lambda gives us 400k GB-s of Lambda Compute each month for free. With proper tuning, this gives us scalable inference of Generative AI LLMs at minimal cost.
We will need the following requirements to get started:
- Docker installed on our system and running.
- AWS CDK installed on our system, as well as an AWS account, proper credentials, etc.
- Python3.9+
Once we've installed the requirements, on Mac/Linux:
cd ./llama_lambda
chmod +x install.sh
./install.sh
Follow the prompts, and when complete, the CDK will provide you with a Lambda Function URL to test the function.
You should see a interface like this:
Note that the model will take a longer time to respond initially. This is because when a request hits a Lambda function that has just been deployed, AWS needs to allocate the necessary resources for it. This process involves setting up a new container, loading the runtime, and then the function code. Since our function involves loading a large model, it will further add to the cold start time.
However, after a function has been invoked, AWS keeps the execution context (i.e., the container) "warm" for some time in anticipation of another request. If another request comes in while the context is still warm, AWS reuses it, avoiding the initialization overhead and thereby providing faster response times. Hence, for subsequent calls, the model should provide a response significantly faster.