Skip to content

Latest commit

 

History

History
287 lines (201 loc) · 44.9 KB

run-llms-locally-on-your-machine.md

File metadata and controls

287 lines (201 loc) · 44.9 KB

SVG Banners

Waving Hand Ready to explore the world of local AI? This guide will help you set up and run language models right on your computer, giving you full control over your AI tools. Waving Hand

Table of Contents


Introduction

This tutorial is designed for individuals seeking greater control and transparency in their data processing, regardless of their background or expertise. We will provide a step-by-step guide on how to set up a local LLM environment using Ollama as the backend and the Page Assist extension in your browser.

The Privacy and security issue of Cloud based Providers

The use of Large Language Model (LLM) services online presents significant privacy concerns:

  • Data Storage and Processing by Third Parties: Your data is stored and processed by third-party providers, which can result in unintended consequences such as:

    • Sharing your input data with other users.
    • Using your data for purposes beyond your initial intent.
  • Algorithm Complexity: The complexity of the algorithms used to train these models poses challenges:

    • Opacity: Algorithms are often opaque, making it difficult to understand how your data is being processed.
    • Bias and Discrimination: This lack of transparency can lead to biased or discriminatory outcomes.
  • Scale and Data Breaches: The large scale of cloud-based LLMs means that even minor issues can result in:

    • Massive data breaches.
    • Compromising the privacy and security of countless users.

Benefits of Running LLM Locally

Running Large Language Model (LLM) models locally offers several compelling benefits:

  • Complete Control Over Input Data: Maintain complete control over your input data and ensure its confidentiality – no sensitive information will be shared with third-party providers.

  • Faster Processing Times and Reduced Latency: Enjoy faster processing times and reduced latency, making it well-suited for applications where real-time feedback is crucial.

  • Avoidance of Data Sovereignty Issues: Keeping your data and model on-premises helps you avoid potential issues related to:

    • Compliance with data localization requirements.
    • Regulatory standards.

Ultimately, running LLM models locally provides a high degree of privacy and control, allowing you to tailor the model's training and deployment to meet your specific goals and objectives.


Find the Provider that is right for you

To get started, select a local model provider that streamlines the process of hosting and deploying a LLM.

This section is designed to provide you with the necessary knowledge and resources to make informed decisions about selecting the right provider for your specific needs.

Provider User Interface Ease of use Model customization Built-in chat interface Model discovery/download Multi-platform support Open-source Integration with other tools
Ollama Command-line & API Simple Yes No (requires client) Limited Windows, macOS, Linux Yes Extensive
LM Studio GUI User-friendly Yes Yes Extensive Windows, macOS, Linux No Limited
Jan GUI User-friendly Limited Yes Limited Windows, macOS, Linux Yes Limited
MSTY GUI User-friendly Limited Yes Limited macOS No Limited
Enchanted GUI User-friendly Limited Yes Limited macOS No Yes (macOS)
AnythingLLM Web-based User-friendly Yes Yes Limited Cross-platform (web-based) Yes Yes

Find the Model that is right for you

Once you have selected your local model provider, you'll need to decide which Model to run. For users with limited system resources or older hardware configurations, I will also provide cloud-based providers that can efficiently run these models.

This section is designed to provide you with the necessary knowledge and resources to make informed decisions about selecting the right model for your specific needs. To facilitate this process, I have prepared two reference tables to support your search.

  • The first table showcases open-source models, which can be run locally on your machine. To ensure optimal performance, I have outlined the recommended hardware requirements for each model.
  • The second table features proprietary models, which typically operate on cloud-based providers.

Important

The VRAM requirements listed in the tables are indicative estimates, calculated for a Q4_0 quantization that represents a balance between model precision and inference speed as recommended by the default Ollama configuration.

Please note that while it may be possible to run certain models with lower hardware specifications, this may result in slower inference speed. If the model does not fit entirely within the VRAM, it will need to transfer some data to system memory, which is significantly slower. This can greatly impact the inference speed.

Note

The models are ranked according to their Quality Index (with higher scores indicating better performance) from the Artificial Analysis LLM Leaderboard. Please note that Quality Index is subject to change based on daily test-run and will be updated regularly to reflect the latest rankings.

We consider this benchmarking methodology to be less biased than the Elo score method employed by LMSys. Furthermore, the LMSys leaderboard does not address datasets contamination , model quantization and model overfitting issues.

Tip

Triangular Flag Badges provide direct links to model downloads, provider services, and official documentation. Triangular Flag

Open Source Models

Massive models : Local deployment can be challenging due to high computational requirements. These models are commonly used on cloud-based provider platforms.

Organization Model Model Size Hardware requirement Quality Index Ollama library Cloud-based providers
Alibaba Qwen2.5-72B-Instruct 72.2B 47GB+ VRAM GPU (2xRTX 4090 or better) 75.2 Ollama Hugging Face
Mistral Mistral-Large-2-Instruct 123B 70GB+ VRAM GPU (3xRTX 4090 or better) 73 Ollama Mistral
Meta Llama-3.1-405b-Instruct 405B 230+ VRAM GPU (4xH100 or better) 71.9 Ollama OpenRouter
Nvidia Llama-3.1-Nemotron-70B-Instruct-HF 70B 40GB+ VRAM GPU (2xRTX 4090 or better) 69.9 Ollama Nvidia
Alibaba Qwen2-72B-Instruct 72B 40GB+ VRAM GPU (2xRTX 4090 or better) 68.9 Ollama Hugging Face
Deepseek Deepseek-v2.5 236B 133GB+ VRAM GPU (2xH100 or better) 65.8 Ollama Deepseek
Meta Llama-3.2-90B-Vision-Instruct 88.6B 40GB+ VRAM GPU (2xRTX 4090 or better) 65.5 Ollama Fireworks
Meta Llama-3.1-70b-Instruct 70B 40GB+ VRAM GPU (2xRTX 4090 or better) 65.3 Ollama Groq
Meta Llama-3-70b-Instruct 70B 40GB+ VRAM GPU (2xRTX 4090 or better) 61.9 Ollama None
Mistral Mixtral-8x22b-Instruct-v0.1 141B 80GB+ VRAM GPU (1xH100 or better) 61.4 Ollama None
cohere Command R+ 104B 60GB+ VRAM GPU (3xRTX 4090 or better) 55.9 Ollama cohere
databricks DBRX-Instruct 132B 80GB+ VRAM GPU (1xH100 or better) 49.6 Ollama None

Mid-sized models : Suitable for deployment on a high-performance local workstation. These models require high-end consumer configurations with a powerful GPU, which can range from 2,000 to 3,400 ($/£/€ equivalent)

Organization Model Model Size Hardware requirement Quality Index Ollama library Cloud-based providers
Alibaba Qwen2.5-32B-Instruct 32.8B 20GB+ VRAM GPU (RX 7900 XT or RTX 4090 or better) (~62) Ollama Hugging Face
Mistral Mistral-Small-Instruct 22.2B 13GB+ VRAM GPU (RX 7800 or RTX 4080 or better 60.40 Ollama Mistral
cohere Command R 35B 20GB+ VRAM GPU (RX 7900 XT or RTX 4090 or better) 51.1 Ollama cohere
InternLM internlm2_5-20b-chat 20B 12GB+ VRAM GPU (RX 7800 or RTX 4070 or better) (~49) Ollama None
Google Gemma-2-27b-it 27B 16GB+ VRAM GPU (RX 7800 or RTX 4080 or better) 48.55 Ollama Hugging Face
Mistral Mixtral-8x7b-Instruct-v0.1 46.7B 26GB+ VRAM GPU (1xH100 or better) 41.9 Ollama Hugging Face

Small models : Lightweight and easily deployable on most local machines. These models require mid-range consumer configurations with a GPU, ranging from 600 to 1,200 ($/£/€ equivalent).

Organization Model Model Size Hardware requirement Quality Index Ollama library Cloud-based providers
Alibaba Qwen2.5-14B-Instruct 14.8B 9GB+ VRAM GPU (rx 7800 or RTX 4070 or better) (~58) Ollama Hugging Face
Mistral Ministral-8B-Instruct 8B 6GB+ VRAM GPU (rx 7600 or RTX 4060 or better) 53.30 No Mistral
Meta Llama-3.2-11b-Vision-Instruct 10.7B 8GB+ VRAM GPU (rx 7600 or RTX 4060 or better) 53.30 No Hugging Face
Meta Llama-3.1-8b-Instruct 8B 6GB+ VRAM GPU (rx 7600 or RTX 4060 or better) 53.15 Ollama Groq
Alibaba Qwen2.5-7B-Instruct 7.62B 6GB+ VRAM GPU (rx 7600 or RTX 4060 or better) (~48) Ollama Hugging Face
Google Gemma-2-9b-it 9B 6GB+ VRAM GPU (rx 7600 or RTX 4060 or better) 46.650 Ollama Hugging Face
Meta Llama-3-8b-Instruct 8B 6GB+ VRAM GPU (rx 7600 or RTX 4060 or better) 46.1 Ollama Perplexity
InternLM internlm2_5-7b-chat 7B 6GB+ VRAM GPU (rx 7600 or RTX 4060 or better) (~44) Ollama None
microsoft Phi-3-medium-128k-instruct 14B 8GB+ VRAM GPU (rx 7600 or RTX 4060 or better) None Ollama None

Tiny models : The smallest models are designed to run on all types of machines, including the oldest ones. These models can be run on most consumer hardware configurations, provided they have at least 6-8 GB of RAM.

Organization Model Model Size Hardware requirement Quality Index Ollama library Cloud-based providers
Meta Llama-3.2-3b-Instruct 3B 4GB+ RAM 46.7 Ollama Groq
Alibaba Qwen2.5-3B-Instruct 3B 4GB+ RAM (~41) Ollama Hugging Face
Hugging Face SmolLM2-1.7B-Instruct 1.7B 2GB+ RAM (~28) Ollama Hugging Face
Google Gemma-2-2b-it 2B 2GB+ RAM 30 Ollama Hugging Face
Meta Llama-3.2-1B-Instruct 1B 2GB+ RAM 27.1 Ollama Groq
InternLM Internlm2_5-1_8b-chat 1.8B 2GB+ RAM (~26) Ollama None
Alibaba Qwen2.5-1.5B-Instruct 1.5B 2GB+ RAM (~25) Ollama Hugging Face
Hugging Face SmolLM2-360M-Instruct 360M 1GB+ RAM (~20) Ollama Hugging Face
Alibaba Qwen2.5-0.5B-Instruct 0.5B 1GB+ RAM (~18) Ollama Hugging Face
Hugging Face SmolLM2-135M-Instruct 135M 0.5GB+ RAM (~15) Ollama Hugging Face

Proprietary Model

Provider Model Quality Index Pricing
OpenAI o1-preview 84.6 Eight-Pointed Star
OpenAI o1-mini 81.6 Eight-Pointed Star
Anthropic Claude-3.5 Sonnet 80 Eight-Pointed Star
Google Gemini 1.5 Pro 79.7 Eight-Pointed Star
OpenAI GPT-4o-latest 77.2 Eight-Pointed Star
01AI Yi-Lightning None (~76.5) Freemium
Alibaba Qwen-Max 75 Freemium
OpenAI GPT-4 Turbo 74.3 Eight-Pointed Star
OpenAI GPT-4o-mini 71.4 Freemium
Anthropic Claude-3 Opus 70.3 Eight-Pointed Star
Google Gemini 1.5 Flash 68.0 Eight-Pointed Star
01AI Yi-Large 58.3 Freemium
Anthropic Claude-3 Sonnet 57.2 Freemium
Reka Reka-Core 56.8 Freemium
Anthropic Claude-3 Haiku 54.2 Freemium
Reka Reka-Flash 46.2 Freemium

Ollama

Ollama is our top recommendation for running LLMs locally due to its robust integration capabilities and adaptability.

As an example, you will find below a step-by-step guide on setting up Ollama as a local model provider through an accessible and user-friendly interface.

Please follow official documentations if you wish to use other providers, or open an issue on this repository if you want a dedicated section for your preferred provider in this file.

Installation

Platform Installation Method
macOS Download
Windows Download
Linux Manual install instructions
Docker Ollama Docker image is available on Docker Hub.

Quickstart


To run and chat with Llama 3.1 write the following input in a terminal:

ollama run llama3.1

This will allow you to chat with the llama3.1:8B model within the command-line interface (CLI). See the list of models available on ollama.com/library.

To download a model without launching it, simply enter the following command:

ollama pull llama3.1

To view the list of models you've downloaded, simply use the following command:

ollama list

User Friendly Ollama Models Interaction

For those who prefer a more user-friendly experience, We'll demonstrate how to interact with your Ollama model through our browser-based interface, which provides a graphical and intuitive way of working with your LLM. We will be using the Page assist extension.

Page Assist is an open-source Chrome Extension that provides a Sidebar and Web UI for your Local AI model. It allows you to interact with your model from any webpage.

Want to explore other possibilities? Take a look at the alternative solutions available in our Local Providers section.

Installation and setup

You can install the extension from the Chrome Web Store

Browser Support

Browser Sidebar Chat With Webpage Web UI
Chrome
Brave
Firefox
Vivaldi
Edge
Opera
Arc

If needed, see the Manual Installation instructions on their github repository.

Once the extension is installed Just click on the extension icon and it'll take you straight to the chatGPT-like UI.

To verify after installation process, I would suggest the following steps :

  • A center message must letting you know that Ollama is running in the background, ready to handle your requests.
  • To the top left corner, a dropdown menu awaits, listing all models you've installed and are currently available for interaction. Simply select the model with which you wish to engage.
  • In the top left icon, click to open a sidebar that enables Conversation Management. This feature allows you to manage and organize your conversations.

Note

When you first interact with your model, there might be a brief delay as it loads into memory. But once you're chatting away, responses should come quickly ! Just remember that processing time can vary depending on your computer's specs.

Usage

Sidebar

Once the extension is installed, you can open the sidebar via context menu or keyboard shortcut. By exploiting this sidebar functionality, you can engage in seamless conversations with your model while leveraging the current web page as contextual reference (website, documentation, PDF...).

▶️ in order to use chat with the current page option you need to set a Embedding Model in the RAG Settings.

Default Keyboard Shortcut: Ctrl+Shift+P

Web UI

You can open the Web UI by clicking on the extension icon which will open a new tab with the Web UI.

Default Keyboard Shortcut: Ctrl+Shift+L

Note

You can change the keyboard shortcuts from the extension settings on the Chrome Extension Management page.