Ready to explore the world of local AI? This guide will help you set up and run language models right on your computer, giving you full control over your AI tools.

Introduction

This tutorial is designed for individuals seeking greater control and transparency in their data processing, regardless of their background or expertise. We will provide a step-by-step guide on how to set up a local LLM environment using Ollama as the backend and the Page Assist extension in your browser.

The Privacy and security issue of Cloud based Providers

The use of Large Language Model (LLM) services online presents significant privacy concerns:

Data Storage and Processing by Third Parties: Your data is stored and processed by third-party providers, which can result in unintended consequences such as:
- Sharing your input data with other users.
- Using your data for purposes beyond your initial intent.
Algorithm Complexity: The complexity of the algorithms used to train these models poses challenges:
- Opacity: Algorithms are often opaque, making it difficult to understand how your data is being processed.
- Bias and Discrimination: This lack of transparency can lead to biased or discriminatory outcomes.
Scale and Data Breaches: The large scale of cloud-based LLMs means that even minor issues can result in:
- Massive data breaches.
- Compromising the privacy and security of countless users.

Benefits of Running LLM Locally

Running Large Language Model (LLM) models locally offers several compelling benefits:

Complete Control Over Input Data: Maintain complete control over your input data and ensure its confidentiality – no sensitive information will be shared with third-party providers.
Faster Processing Times and Reduced Latency: Enjoy faster processing times and reduced latency, making it well-suited for applications where real-time feedback is crucial.
Avoidance of Data Sovereignty Issues: Keeping your data and model on-premises helps you avoid potential issues related to:
- Compliance with data localization requirements.
- Regulatory standards.

Ultimately, running LLM models locally provides a high degree of privacy and control, allowing you to tailor the model's training and deployment to meet your specific goals and objectives.

Find the Provider that is right for you

To get started, select a local model provider that streamlines the process of hosting and deploying a LLM.

This section is designed to provide you with the necessary knowledge and resources to make informed decisions about selecting the right provider for your specific needs.

Provider	User Interface	Ease of use	Model customization	Built-in chat interface	Model discovery/download	Multi-platform support	Open-source	Integration with other tools
Ollama	Command-line & API	Simple	Yes	No (requires client)	Limited	Windows, macOS, Linux	Yes	Extensive
LM Studio	GUI	User-friendly	Yes	Yes	Extensive	Windows, macOS, Linux	No	Limited
Jan	GUI	User-friendly	Limited	Yes	Limited	Windows, macOS, Linux	Yes	Limited
MSTY	GUI	User-friendly	Limited	Yes	Limited	macOS	No	Limited
Enchanted	GUI	User-friendly	Limited	Yes	Limited	macOS	No	Yes (macOS)
AnythingLLM	Web-based	User-friendly	Yes	Yes	Limited	Cross-platform (web-based)	Yes	Yes

Find the Model that is right for you

Once you have selected your local model provider, you'll need to decide which Model to run. For users with limited system resources or older hardware configurations, I will also provide cloud-based providers that can efficiently run these models.

This section is designed to provide you with the necessary knowledge and resources to make informed decisions about selecting the right model for your specific needs. To facilitate this process, I have prepared two reference tables to support your search.

The first table showcases open-source models, which can be run locally on your machine. To ensure optimal performance, I have outlined the recommended hardware requirements for each model.
The second table features proprietary models, which typically operate on cloud-based providers.

Important

The VRAM requirements listed in the tables are indicative estimates, calculated for a Q4_0 quantization that represents a balance between model precision and inference speed as recommended by the default Ollama configuration.

Please note that while it may be possible to run certain models with lower hardware specifications, this may result in slower inference speed. If the model does not fit entirely within the VRAM, it will need to transfer some data to system memory, which is significantly slower. This can greatly impact the inference speed.

Note

The models are ranked according to their Quality Index (with higher scores indicating better performance) from the Artificial Analysis LLM Leaderboard. Please note that Quality Index is subject to change based on daily test-run and will be updated regularly to reflect the latest rankings.

We consider this benchmarking methodology to be less biased than the Elo score method employed by LMSys. Furthermore, the LMSys leaderboard does not address datasets contamination , model quantization and model overfitting issues.

Tip

Badges provide direct links to model downloads, provider services, and official documentation.

Open Source Models

Massive models : Local deployment can be challenging due to high computational requirements. These models are commonly used on cloud-based provider platforms.

Model	Model Size	Hardware requirement	Quality Index	Cloud-based providers
Qwen2.5-72B-Instruct	72.2B	47GB+ VRAM GPU (2xRTX 4090 or better)	75.2
Mistral-Large-2-Instruct	123B	70GB+ VRAM GPU (3xRTX 4090 or better)	73
Llama-3.1-405b-Instruct	405B	230+ VRAM GPU (4xH100 or better)	71.9
Llama-3.1-Nemotron-70B-Instruct-HF	70B	40GB+ VRAM GPU (2xRTX 4090 or better)	69.9
Qwen2-72B-Instruct	72B	40GB+ VRAM GPU (2xRTX 4090 or better)	68.9
Deepseek-v2.5	236B	133GB+ VRAM GPU (2xH100 or better)	65.8
Llama-3.2-90B-Vision-Instruct	88.6B	40GB+ VRAM GPU (2xRTX 4090 or better)	65.5
Llama-3.1-70b-Instruct	70B	40GB+ VRAM GPU (2xRTX 4090 or better)	65.3
Llama-3-70b-Instruct	70B	40GB+ VRAM GPU (2xRTX 4090 or better)	61.9	None
Mixtral-8x22b-Instruct-v0.1	141B	80GB+ VRAM GPU (1xH100 or better)	61.4	None
Command R+	104B	60GB+ VRAM GPU (3xRTX 4090 or better)	55.9
DBRX-Instruct	132B	80GB+ VRAM GPU (1xH100 or better)	49.6	None

Mid-sized models : Suitable for deployment on a high-performance local workstation. These models require high-end consumer configurations with a powerful GPU, which can range from 2,000 to 3,400 ($/£/€ equivalent)

Model	Model Size	Hardware requirement	Quality Index	Cloud-based providers
Qwen2.5-32B-Instruct	32.8B	20GB+ VRAM GPU (RX 7900 XT or RTX 4090 or better)	(~62)
Mistral-Small-Instruct	22.2B	13GB+ VRAM GPU (RX 7800 or RTX 4080 or better	60.40
Command R	35B	20GB+ VRAM GPU (RX 7900 XT or RTX 4090 or better)	51.1
internlm2_5-20b-chat	20B	12GB+ VRAM GPU (RX 7800 or RTX 4070 or better)	(~49)	None
Gemma-2-27b-it	27B	16GB+ VRAM GPU (RX 7800 or RTX 4080 or better)	48.55
Mixtral-8x7b-Instruct-v0.1	46.7B	26GB+ VRAM GPU (1xH100 or better)	41.9

Small models : Lightweight and easily deployable on most local machines. These models require mid-range consumer configurations with a GPU, ranging from 600 to 1,200 ($/£/€ equivalent).

Model	Model Size	Hardware requirement	Quality Index	Ollama library	Cloud-based providers
Qwen2.5-14B-Instruct	14.8B	9GB+ VRAM GPU (rx 7800 or RTX 4070 or better)	(~58)
Ministral-8B-Instruct	8B	6GB+ VRAM GPU (rx 7600 or RTX 4060 or better)	53.30	No
Llama-3.2-11b-Vision-Instruct	10.7B	8GB+ VRAM GPU (rx 7600 or RTX 4060 or better)	53.30	No
Llama-3.1-8b-Instruct	8B	6GB+ VRAM GPU (rx 7600 or RTX 4060 or better)	53.15
Qwen2.5-7B-Instruct	7.62B	6GB+ VRAM GPU (rx 7600 or RTX 4060 or better)	(~48)
Gemma-2-9b-it	9B	6GB+ VRAM GPU (rx 7600 or RTX 4060 or better)	46.650
Llama-3-8b-Instruct	8B	6GB+ VRAM GPU (rx 7600 or RTX 4060 or better)	46.1
internlm2_5-7b-chat	7B	6GB+ VRAM GPU (rx 7600 or RTX 4060 or better)	(~44)		None
Phi-3-medium-128k-instruct	14B	8GB+ VRAM GPU (rx 7600 or RTX 4060 or better)	None		None

Tiny models : The smallest models are designed to run on all types of machines, including the oldest ones. These models can be run on most consumer hardware configurations, provided they have at least 6-8 GB of RAM.

Model	Model Size	Hardware requirement	Quality Index	Cloud-based providers
Llama-3.2-3b-Instruct	3B	4GB+ RAM	46.7
Qwen2.5-3B-Instruct	3B	4GB+ RAM	(~41)
SmolLM2-1.7B-Instruct	1.7B	2GB+ RAM	(~28)
Gemma-2-2b-it	2B	2GB+ RAM	30
Llama-3.2-1B-Instruct	1B	2GB+ RAM	27.1
Internlm2_5-1_8b-chat	1.8B	2GB+ RAM	(~26)	None
Qwen2.5-1.5B-Instruct	1.5B	2GB+ RAM	(~25)
SmolLM2-360M-Instruct	360M	1GB+ RAM	(~20)
Qwen2.5-0.5B-Instruct	0.5B	1GB+ RAM	(~18)
SmolLM2-135M-Instruct	135M	0.5GB+ RAM	(~15)

Proprietary Model

Provider	Model	Quality Index	Pricing
	o1-preview	84.6
	o1-mini	81.6
	Claude-3.5 Sonnet	80
	Gemini 1.5 Pro	79.7
	GPT-4o-latest	77.2
	Yi-Lightning	None (~76.5)
	Qwen-Max	75
	GPT-4 Turbo	74.3
	GPT-4o-mini	71.4
	Claude-3 Opus	70.3
	Gemini 1.5 Flash	68.0
	Yi-Large	58.3
	Claude-3 Sonnet	57.2
	Reka-Core	56.8
	Claude-3 Haiku	54.2
	Reka-Flash	46.2

Ollama

Ollama is our top recommendation for running LLMs locally due to its robust integration capabilities and adaptability.

As an example, you will find below a step-by-step guide on setting up Ollama as a local model provider through an accessible and user-friendly interface.

Please follow official documentations if you wish to use other providers, or open an issue on this repository if you want a dedicated section for your preferred provider in this file.

Installation

Platform	Installation Method
macOS	Download
Windows	Download
Linux	Manual install instructions
Docker	Ollama Docker image is available on Docker Hub.

Quickstart

To run and chat with Llama 3.1 write the following input in a terminal:

ollama run llama3.1

This will allow you to chat with the llama3.1:8B model within the command-line interface (CLI). See the list of models available on ollama.com/library.

To download a model without launching it, simply enter the following command:

ollama pull llama3.1

To view the list of models you've downloaded, simply use the following command:

ollama list

User Friendly Ollama Models Interaction

For those who prefer a more user-friendly experience, We'll demonstrate how to interact with your Ollama model through our browser-based interface, which provides a graphical and intuitive way of working with your LLM. We will be using the Page assist extension.

Page Assist is an open-source Chrome Extension that provides a Sidebar and Web UI for your Local AI model. It allows you to interact with your model from any webpage.

Want to explore other possibilities? Take a look at the alternative solutions available in our Local Providers section.

Installation and setup

You can install the extension from the Chrome Web Store

Browser Support

Browser	Sidebar	Chat With Webpage	Web UI
Chrome	✅	✅	✅
Brave	✅	✅	✅
Firefox	✅	✅	✅
Vivaldi	✅	✅	✅
Edge	✅	❌	✅
Opera	❌	❌	✅
Arc	❌	❌	✅

If needed, see the Manual Installation instructions on their github repository.

Once the extension is installed Just click on the extension icon and it'll take you straight to the chatGPT-like UI.

To verify after installation process, I would suggest the following steps :

A center message must letting you know that Ollama is running in the background, ready to handle your requests.
To the top left corner, a dropdown menu awaits, listing all models you've installed and are currently available for interaction. Simply select the model with which you wish to engage.
In the top left icon, click to open a sidebar that enables Conversation Management. This feature allows you to manage and organize your conversations.

Note

When you first interact with your model, there might be a brief delay as it loads into memory. But once you're chatting away, responses should come quickly ! Just remember that processing time can vary depending on your computer's specs.

Usage

Sidebar

Once the extension is installed, you can open the sidebar via context menu or keyboard shortcut. By exploiting this sidebar functionality, you can engage in seamless conversations with your model while leveraging the current web page as contextual reference (website, documentation, PDF...).

▶️ in order to use chat with the current page option you need to set a Embedding Model in the RAG Settings.

Default Keyboard Shortcut: Ctrl+Shift+P

Web UI

You can open the Web UI by clicking on the extension icon which will open a new tab with the Web UI.

Default Keyboard Shortcut: Ctrl+Shift+L

Note

You can change the keyboard shortcuts from the extension settings on the Chrome Extension Management page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run-llms-locally-on-your-machine.md

run-llms-locally-on-your-machine.md

Table of Contents

Introduction

The Privacy and security issue of Cloud based Providers

Benefits of Running LLM Locally

Find the Provider that is right for you

Find the Model that is right for you

Open Source Models

Proprietary Model

Ollama

Installation

Quickstart

User Friendly Ollama Models Interaction

Installation and setup

Browser Support

Usage

Sidebar

Web UI

Files

run-llms-locally-on-your-machine.md

Latest commit

History

run-llms-locally-on-your-machine.md

File metadata and controls

Table of Contents

Introduction

The Privacy and security issue of Cloud based Providers

Benefits of Running LLM Locally

Find the Provider that is right for you

Find the Model that is right for you

Open Source Models

Proprietary Model

Ollama

Installation

Quickstart

User Friendly Ollama Models Interaction

Installation and setup

Browser Support

Usage

Sidebar

Web UI