Run code-llama with 32k tokens using flash attention and better transformer
Option 1 - Google Colab:
- Download the ipynb notebook
- Select a GPU
- A100 with 40 GB will allow for 25k context length
Option 2 - Run on a server (e.g. AWS or RunPod (affiliate link))
- Spin up an A100 80 GB server
- Run the notebook and select 50,000 context length
- Allows for saving and re-loading of conversations
- Allows for uploading and analysis of documents
- Works on Google Colab or on a Server (e.g. AWS, Azure, RunPod)
- Purchase here