Ready to explore the world of local AI? This guide will help you set up and run language models right on your computer, giving you full control over your AI tools.
This tutorial is designed for individuals seeking greater control and transparency in their data processing, regardless of their background or expertise. We will provide a step-by-step guide on how to set up a local LLM environment using Ollama as the backend and the Page Assist extension in your browser.
The use of Large Language Model (LLM) services online presents significant privacy concerns:
-
Data Storage and Processing by Third Parties: Your data is stored and processed by third-party providers, which can result in unintended consequences such as:
- Sharing your input data with other users.
- Using your data for purposes beyond your initial intent.
-
Algorithm Complexity: The complexity of the algorithms used to train these models poses challenges:
- Opacity: Algorithms are often opaque, making it difficult to understand how your data is being processed.
- Bias and Discrimination: This lack of transparency can lead to biased or discriminatory outcomes.
-
Scale and Data Breaches: The large scale of cloud-based LLMs means that even minor issues can result in:
- Massive data breaches.
- Compromising the privacy and security of countless users.
Running Large Language Model (LLM) models locally offers several compelling benefits:
-
Complete Control Over Input Data: Maintain complete control over your input data and ensure its confidentiality – no sensitive information will be shared with third-party providers.
-
Faster Processing Times and Reduced Latency: Enjoy faster processing times and reduced latency, making it well-suited for applications where real-time feedback is crucial.
-
Avoidance of Data Sovereignty Issues: Keeping your data and model on-premises helps you avoid potential issues related to:
- Compliance with data localization requirements.
- Regulatory standards.
Ultimately, running LLM models locally provides a high degree of privacy and control, allowing you to tailor the model's training and deployment to meet your specific goals and objectives.
To get started, select a local model provider that streamlines the process of hosting and deploying a LLM.
This section is designed to provide you with the necessary knowledge and resources to make informed decisions about selecting the right provider for your specific needs.
Provider | User Interface | Ease of use | Model customization | Built-in chat interface | Model discovery/download | Multi-platform support | Open-source | Integration with other tools |
---|---|---|---|---|---|---|---|---|
Ollama | Command-line & API | Simple | Yes | No (requires client) | Limited | Windows, macOS, Linux | Yes | Extensive |
LM Studio | GUI | User-friendly | Yes | Yes | Extensive | Windows, macOS, Linux | No | Limited |
Jan | GUI | User-friendly | Limited | Yes | Limited | Windows, macOS, Linux | Yes | Limited |
MSTY | GUI | User-friendly | Limited | Yes | Limited | macOS | No | Limited |
Enchanted | GUI | User-friendly | Limited | Yes | Limited | macOS | No | Yes (macOS) |
AnythingLLM | Web-based | User-friendly | Yes | Yes | Limited | Cross-platform (web-based) | Yes | Yes |
Once you have selected your local model provider, you'll need to decide which Model to run. For users with limited system resources or older hardware configurations, I will also provide cloud-based providers that can efficiently run these models.
This section is designed to provide you with the necessary knowledge and resources to make informed decisions about selecting the right model for your specific needs. To facilitate this process, I have prepared two reference tables to support your search.
- The first table showcases open-source models, which can be run locally on your machine. To ensure optimal performance, I have outlined the recommended hardware requirements for each model.
- The second table features proprietary models, which typically operate on cloud-based providers.
Important
The VRAM requirements listed in the tables are indicative estimates, calculated for a Q4_0 quantization that represents a balance between model precision and inference speed as recommended by the default Ollama configuration.
Please note that while it may be possible to run certain models with lower hardware specifications, this may result in slower inference speed. If the model does not fit entirely within the VRAM, it will need to transfer some data to system memory, which is significantly slower. This can greatly impact the inference speed.
Note
The models are ranked according to their Quality Index (with higher scores indicating better performance) from the Artificial Analysis LLM Leaderboard. Please note that Quality Index is subject to change based on daily test-run and will be updated regularly to reflect the latest rankings.
We consider this benchmarking methodology to be less biased than the Elo score method employed by LMSys. Furthermore, the LMSys leaderboard does not address datasets contamination , model quantization and model overfitting issues.
Massive models : Local deployment can be challenging due to high computational requirements. These models are commonly used on cloud-based provider platforms.
Organization | Model | Model Size | Hardware requirement | Quality Index | Ollama library | Cloud-based providers |
---|---|---|---|---|---|---|
Qwen2.5-72B-Instruct | 72.2B | 47GB+ VRAM GPU (2xRTX 4090 or better) | 75.2 | |||
Mistral-Large-2-Instruct | 123B | 70GB+ VRAM GPU (3xRTX 4090 or better) | 73 | |||
Llama-3.1-405b-Instruct | 405B | 230+ VRAM GPU (4xH100 or better) | 71.9 | |||
Llama-3.1-Nemotron-70B-Instruct-HF | 70B | 40GB+ VRAM GPU (2xRTX 4090 or better) | 69.9 | |||
Qwen2-72B-Instruct | 72B | 40GB+ VRAM GPU (2xRTX 4090 or better) | 68.9 | |||
Deepseek-v2.5 | 236B | 133GB+ VRAM GPU (2xH100 or better) | 65.8 | |||
Llama-3.2-90B-Vision-Instruct | 88.6B | 40GB+ VRAM GPU (2xRTX 4090 or better) | 65.5 | |||
Llama-3.1-70b-Instruct | 70B | 40GB+ VRAM GPU (2xRTX 4090 or better) | 65.3 | |||
Llama-3-70b-Instruct | 70B | 40GB+ VRAM GPU (2xRTX 4090 or better) | 61.9 | None | ||
Mixtral-8x22b-Instruct-v0.1 | 141B | 80GB+ VRAM GPU (1xH100 or better) | 61.4 | None | ||
Command R+ | 104B | 60GB+ VRAM GPU (3xRTX 4090 or better) | 55.9 | |||
DBRX-Instruct | 132B | 80GB+ VRAM GPU (1xH100 or better) | 49.6 | None |
Mid-sized models : Suitable for deployment on a high-performance local workstation. These models require high-end consumer configurations with a powerful GPU, which can range from 2,000 to 3,400 ($/£/€ equivalent)
Organization | Model | Model Size | Hardware requirement | Quality Index | Ollama library | Cloud-based providers |
---|---|---|---|---|---|---|
Qwen2.5-32B-Instruct | 32.8B | 20GB+ VRAM GPU (RX 7900 XT or RTX 4090 or better) | (~62) | |||
Mistral-Small-Instruct | 22.2B | 13GB+ VRAM GPU (RX 7800 or RTX 4080 or better | 60.40 | |||
Command R | 35B | 20GB+ VRAM GPU (RX 7900 XT or RTX 4090 or better) | 51.1 | |||
internlm2_5-20b-chat | 20B | 12GB+ VRAM GPU (RX 7800 or RTX 4070 or better) | (~49) | None | ||
Gemma-2-27b-it | 27B | 16GB+ VRAM GPU (RX 7800 or RTX 4080 or better) | 48.55 | |||
Mixtral-8x7b-Instruct-v0.1 | 46.7B | 26GB+ VRAM GPU (1xH100 or better) | 41.9 |
Small models : Lightweight and easily deployable on most local machines. These models require mid-range consumer configurations with a GPU, ranging from 600 to 1,200 ($/£/€ equivalent).
Organization | Model | Model Size | Hardware requirement | Quality Index | Ollama library | Cloud-based providers |
---|---|---|---|---|---|---|
Qwen2.5-14B-Instruct | 14.8B | 9GB+ VRAM GPU (rx 7800 or RTX 4070 or better) | (~58) | |||
Ministral-8B-Instruct | 8B | 6GB+ VRAM GPU (rx 7600 or RTX 4060 or better) | 53.30 | No | ||
Llama-3.2-11b-Vision-Instruct | 10.7B | 8GB+ VRAM GPU (rx 7600 or RTX 4060 or better) | 53.30 | No | ||
Llama-3.1-8b-Instruct | 8B | 6GB+ VRAM GPU (rx 7600 or RTX 4060 or better) | 53.15 | |||
Qwen2.5-7B-Instruct | 7.62B | 6GB+ VRAM GPU (rx 7600 or RTX 4060 or better) | (~48) | |||
Gemma-2-9b-it | 9B | 6GB+ VRAM GPU (rx 7600 or RTX 4060 or better) | 46.650 | |||
Llama-3-8b-Instruct | 8B | 6GB+ VRAM GPU (rx 7600 or RTX 4060 or better) | 46.1 | |||
internlm2_5-7b-chat | 7B | 6GB+ VRAM GPU (rx 7600 or RTX 4060 or better) | (~44) | None | ||
Phi-3-medium-128k-instruct | 14B | 8GB+ VRAM GPU (rx 7600 or RTX 4060 or better) | None | None |
Tiny models : The smallest models are designed to run on all types of machines, including the oldest ones. These models can be run on most consumer hardware configurations, provided they have at least 6-8 GB of RAM.
Organization | Model | Model Size | Hardware requirement | Quality Index | Ollama library | Cloud-based providers |
---|---|---|---|---|---|---|
Llama-3.2-3b-Instruct | 3B | 4GB+ RAM | 46.7 | |||
Qwen2.5-3B-Instruct | 3B | 4GB+ RAM | (~41) | |||
SmolLM2-1.7B-Instruct | 1.7B | 2GB+ RAM | (~28) | |||
Gemma-2-2b-it | 2B | 2GB+ RAM | 30 | |||
Llama-3.2-1B-Instruct | 1B | 2GB+ RAM | 27.1 | |||
Internlm2_5-1_8b-chat | 1.8B | 2GB+ RAM | (~26) | None | ||
Qwen2.5-1.5B-Instruct | 1.5B | 2GB+ RAM | (~25) | |||
SmolLM2-360M-Instruct | 360M | 1GB+ RAM | (~20) | |||
Qwen2.5-0.5B-Instruct | 0.5B | 1GB+ RAM | (~18) | |||
SmolLM2-135M-Instruct | 135M | 0.5GB+ RAM | (~15) |
Provider | Model | Quality Index | Pricing |
---|---|---|---|
o1-preview | 84.6 | ||
o1-mini | 81.6 | ||
Claude-3.5 Sonnet | 80 | ||
Gemini 1.5 Pro | 79.7 | ||
GPT-4o-latest | 77.2 | ||
Yi-Lightning | None (~76.5) | ||
Qwen-Max | 75 | ||
GPT-4 Turbo | 74.3 | ||
GPT-4o-mini | 71.4 | ||
Claude-3 Opus | 70.3 | ||
Gemini 1.5 Flash | 68.0 | ||
Yi-Large | 58.3 | ||
Claude-3 Sonnet | 57.2 | ||
Reka-Core | 56.8 | ||
Claude-3 Haiku | 54.2 | ||
Reka-Flash | 46.2 |
Ollama is our top recommendation for running LLMs locally due to its robust integration capabilities and adaptability.
As an example, you will find below a step-by-step guide on setting up Ollama as a local model provider through an accessible and user-friendly interface.
Please follow official documentations if you wish to use other providers, or open an issue on this repository if you want a dedicated section for your preferred provider in this file.
Platform | Installation Method |
---|---|
macOS | Download |
Windows | Download |
Linux | Manual install instructions |
Docker | Ollama Docker image is available on Docker Hub. |
To run and chat with Llama 3.1 write the following input in a terminal:
ollama run llama3.1
This will allow you to chat with the llama3.1:8B model within the command-line interface (CLI). See the list of models available on ollama.com/library.
To download a model without launching it, simply enter the following command:
ollama pull llama3.1
To view the list of models you've downloaded, simply use the following command:
ollama list
For those who prefer a more user-friendly experience, We'll demonstrate how to interact with your Ollama model through our browser-based interface, which provides a graphical and intuitive way of working with your LLM. We will be using the Page assist extension.
Page Assist is an open-source Chrome Extension that provides a Sidebar and Web UI for your Local AI model. It allows you to interact with your model from any webpage.
Want to explore other possibilities? Take a look at the alternative solutions available in our Local Providers section.
You can install the extension from the Chrome Web Store
Browser | Sidebar | Chat With Webpage | Web UI |
---|---|---|---|
Chrome | ✅ | ✅ | ✅ |
Brave | ✅ | ✅ | ✅ |
Firefox | ✅ | ✅ | ✅ |
Vivaldi | ✅ | ✅ | ✅ |
Edge | ✅ | ❌ | ✅ |
Opera | ❌ | ❌ | ✅ |
Arc | ❌ | ❌ | ✅ |
If needed, see the Manual Installation instructions on their github repository.
Once the extension is installed Just click on the extension icon and it'll take you straight to the chatGPT-like UI.
To verify after installation process, I would suggest the following steps :
- A center message must letting you know that Ollama is running in the background, ready to handle your requests.
- To the top left corner, a dropdown menu awaits, listing all models you've installed and are currently available for interaction. Simply select the model with which you wish to engage.
- In the top left icon, click to open a sidebar that enables Conversation Management. This feature allows you to manage and organize your conversations.
Note
When you first interact with your model, there might be a brief delay as it loads into memory. But once you're chatting away, responses should come quickly ! Just remember that processing time can vary depending on your computer's specs.
Once the extension is installed, you can open the sidebar via context menu or keyboard shortcut. By exploiting this sidebar functionality, you can engage in seamless conversations with your model while leveraging the current web page as contextual reference (website, documentation, PDF...).
chat with the current page
option you need to set a Embedding Model in the RAG Settings
.
Default Keyboard Shortcut:
Ctrl+Shift+P
You can open the Web UI by clicking on the extension icon which will open a new tab with the Web UI.
Default Keyboard Shortcut:
Ctrl+Shift+L
Note
You can change the keyboard shortcuts from the extension settings on the Chrome Extension Management page.