From 53053d37ba9ec92e92eef24a5ace5df1e9f8fe63 Mon Sep 17 00:00:00 2001 From: Orion Reblitz-Richardson Date: Sun, 14 Apr 2024 11:04:12 -0700 Subject: [PATCH] [Usability] Break README.md into sections and remove GETTING-STARTED.md (#176) * Partial revamp of README.md into sections and remove GETTING-STARTED.md. Include GGUF updates from https://github.com/pytorch/torchat/pull/176 --- GETTING-STARTED.md | 40 ------ README.md | 310 +++++++++++++++++++++------------------------ 2 files changed, 147 insertions(+), 203 deletions(-) delete mode 100644 GETTING-STARTED.md diff --git a/GETTING-STARTED.md b/GETTING-STARTED.md deleted file mode 100644 index 040a3f05f7..0000000000 --- a/GETTING-STARTED.md +++ /dev/null @@ -1,40 +0,0 @@ - -Instructions, as suggested by @Orion. (Consider creating a version -with text interspersed as Google Colab and link it here at the top.) - -``` -python3 -m pip install --user virtualenv -python3 -m virtualenv .llama-fast -source .llama-fast/bin/activate -git clone https://github.com/pytorch/torchat.git -cd llama-fast -git submodule sync -git submodule update --init - -# If we need PyTorch nightlies -pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu -# Otherwise -# pip3 install torch - -pip install sentencepiece huggingface_hub -# Eventually should be (when Dave has the PyPI packages) -# pip install sentencepiece huggingface_hub executorch -# I had some issues with the pytorch submodule not downloading from ExecuTorch - not sure why - -# To download Llama 2 models, go to https://huggingface.co/meta-llama/Llama-2-7b and go through steps to obtain access. - -# Once approved, login with -huggingface-cli login -# You will be asked for a token from https://huggingface.co/settings/tokens - -# Set the model and paths for stories15M as an example to test things on desktop and mobile -MODEL_NAME=stories15M -MODEL_PATH=checkpoints/${MODEL_NAME}/stories15M.pt -MODEL_DIR=~/llama-fast-exports - -# Could we make this stories15 instead? -export MODEL_DOWNLOAD=meta-llama/Llama-2-7b-chat-hf -./scripts/prepare.sh $MODEL_DOWNLOAD -python generate.py --compile --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --device {cuda,cpu,mps} - -``` \ No newline at end of file diff --git a/README.md b/README.md index b648f2269e..cb0cf8d81c 100644 --- a/README.md +++ b/README.md @@ -1,163 +1,152 @@ -# Preamble. +# Torchat is still in pre-release! -*The statements contained in this README are our northstar, and we will be reality-testing the statement, and remove any -items that are not factual. If you find an item that is incorrect, please tag as an issue, so we can triage and determine whether to fix, -or drop from our initial release.* -# torchat *NORTHSTAR* -A repo for building and using llama on servers, desktops and mobile with torchat (pronounced torch-chat). +Torchat is currently in a pre-release state and under extensive development. -The torchat repo enables model inference of llama models (and other LLMs) on servers, desktop and mobile devices. -For a list of devices, see below, under *SUPPORTED SYSTEMS*. -A goal of this repo, and the design of the PT2 components was to offer seamless integration and consistent workflows. -Both mobile and server/desktop paths start with `torch.export()` receiving the same model description. Similarly, -integration into runners for Python (for initial testing) and Python-free environments (for deployment, in runner-aoti -and runner-et, respectively) offer a consistent experience across backends and offer developers consistent interfaces -and user experience whether they target server, desktop or mobile & edge use cases, and/or all of them. +# Torchat +[**Introduction**](#introduction) | [**Installation**](#installation) | [**Get Started**](#get-started) | [**Download**](#download) | [**Chat**](#chat) | [**Generate**](#generate) | [**Eval**](#eval) | [**Export**](#export) | [**Supported Systems**](#supported-systems) | [**Contributing**](#contributing) | [**License**](#license) -# Simple and efficient pytorch-native transformer text generation. +  -Featuring: +## Introduction -* Very low latency -* <1000 lines of Python -* No dependencies other than PyTorch and sentencepiece for server, and Executorch for mobile (plus, your mobile IDE, of course) -* int8/int4 quantization (for linear and embedding operators) -* Supports Nvidia and AMD GPUs, Apple GPUs with MPS, CPU (Linux/x86 and MacOS/ARM), and xnnpack, Vulkan and MPS for mobile GPUs, - and backend-specific mobile runtimes ("delegates", such as CoreML and Hexagon). +Torchat (pronounced β€œtorch chat” and also a play on torch @ [laptop, desktop, mobile]) is a tool and library to easily run LLMs on laptops, desktops, and mobile devices using pure [PyTorch](https://github.com/pytorch/pytorch) and [ExecuTorch](https://github.com/pytorch/executorch). See below for a [full list of supported devices](#supported-systems). -The model definition (and much more!) is adopted from gpt-fast, so we support the same models. As new models are supported by gpt-fast, -bringing them into torchat should be straight forward. In addition, we invite community contributions +The library provides: -# Torchat usage +- Command line interaction with popular LLMs through PyTorch eager and torch.compile +- Export to laptop and desktop through AOT Inductor +- Export to Android and iOS through [ExecuTorch](https://github.com/pytorch/executorch) +- Very low latency through quantization and optimized kernels +- Hackable PyTorch models and integration to [torchtune](https://github.com/pytorch/torchtune) for model fine-tuning +- Import of GGUF models +- <1000 lines of python +- Quantization to int8 and int4 for linear and embedding operators +- Support for Nvidia and AMD GPUs, Apple GPUs with MPS, CPU (Linux/x86, MacOS/ARM), Mobile CPU with XNNPACK, Mobile GPU with Vulkan and CoreML. Hardware-specific delegates through CoreML and HTP. -torchat tools are either accessible through a common interface torchat, supporting chat, prompted text generation, model export, -model test, and a standalone C/C++ runtime for server. +While we strive to support a broad range of models, we can't test them all. We classify supported models as tested βœ…, +work in progress 🚧 or some restrictions ❹. As always, we invite community contributions of new model suport and test results! -| function | torchat call | direct command | tested | -|---|----|----|-----| -chat | `torchat --chat` | n/a | 🚧 | -generate text | `torchat --generate` |`generate` | βœ… | -evaluate model | `torchat --eval` | `eval` | 🚧 | -export model | `torchat --export` | `export` | βœ… | -exported model test (dso,pte) | `torchat --chat` | n/a | 🚧 | -exported model test (dso,pte) | `torchat --generate` |`generate` | βœ… | -evaluate exported model (dso,pte) | `torchat --eval` | `eval` | 🚧 | -server C++ runtime | n/a | run.cpp model.so | βœ… | -server C++ runtime | n/a | run.cpp model.pte | βœ… | -mobile C++ runtime | n/a | app model.pte | βœ… | -mobile C++ runtime | n/a | app + AOTI | 🚧 | - -Advantageously, exported models can be laoded back into torchat for chat or text generation, thus enabling experimentation with the exported model -as well as model quality validation, retaining the prior Python interface to simplify reuse of tests and test harnesses developed in Python during -model development and qualification. +| Model | Tested | Eager | torch.compile | AOT Inductor | ExecuTorch | Fits on Mobile | +|-----|--------|-------|-----|-----|-----|-----| +meta-llama/Llama-3-7b | 🚧 | βœ… | βœ… | βœ… | βœ… | ❹ | +meta-llama/Llama-2-7b-chat-hf | 🚧 | βœ… | βœ… | βœ… | βœ… | ❹| +meta-llama/Llama-2-13b-chat-hf | - | βœ… | βœ… | βœ… | βœ… | πŸ“΅ | +meta-llama/Llama-2-70b-chat-hf | - | βœ… | βœ… | βœ… | βœ… | ❌| +tinyllamas/stories15M | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | +tinyllamas/stories42M | - | βœ… | βœ… | βœ… | βœ… | βœ… | +tinyllamas/stories110M | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | +openlm-research/open_llama_7b | 🚧 | βœ… | βœ… | βœ… | βœ… | ❹ | +codellama/CodeLlama-7b-Python-hf | -| βœ… | βœ… | βœ… | βœ… | ❹| +codellama/CodeLlama-34b-Python-hf | -| βœ… | βœ… | βœ… | βœ… | ❌ | +mistralai/Mistral-7B-v0.1 | 🚧 | βœ… | βœ… | βœ… | βœ… | ❹ | +mistralai/Mistral-7B-Instruct-v0.1 | - | βœ… | βœ… | βœ… | βœ… | ❹ | +mistralai/Mistral-7B-Instruct-v0.2 | - | βœ… | βœ… | βœ… | βœ… | ❹ | -In addition, torchat comes with server C++ runtimes for both AOT Inductor-compiled models and Executorch-compiled models for servers. Mobile C++ runtimes -enable the deployment of Executorch-compiled PTE mobile/edge models on iOS, Android and Raspberry Pi 5. In addition, an experimental mobile C++ runtime -for AOT Inductor compiled models exists as prototype (as of 4/10). +*Key:* βœ… works correctly; 🚧 work in progress; ❌ not supported; ❹ requires 4bit groupwise quantization; πŸ“΅ not on mobile (may fit some high-end devices such as tablets); -`torchat --eval` (`eval` as direct command) give access to Eleuthera eval suite for eager models with/without torch.compile (optionally with user-defined quantization), DSO models and PTE models. +  -# Getting started +--- -Follow the `gpt-fast` [installation instructions](https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#installation). -Because torchat was designed to showcase the latest and greatest PyTorch 2 features for Llama (and related llama-style) models, many of the features used in torchat are hot off the press. [Download PyTorch nightly](https://pytorch.org/get-started/locally/) with the latest steaming hot PyTorch 2 features. +## Installation +Currently `torchat` must be built via cloning the repository and installing as follows: -Install sentencepiece and huggingface_hub -```bash -pip install sentencepiece huggingface_hub +``` +git clone https://github.com/pytorch/torchat.git +cd torchat +pip install -r requirements.txt ``` -If you are planning on using mobile backends, you will also [install ExecuTorch](https://pytorch.org/executorch/stable/getting-started-setup.html) and any hardware-specific libraries and IDEs. - -To download llama models, go to https://huggingface.co/meta-llama/Llama-2-7b and go through steps to obtain access. -Then, login with `huggingface-cli login` - -## Downloading Weights +To confirm that the package is installed correctly, you can run the following command: -To download Llama 2 models, go to [https://huggingface.co/meta-llama/Llama-2-7b] and go through steps to obtain access. -Once approved, login with ``` -huggingface-cli login +torchat --help ``` -You will be asked for a token from [https://huggingface.co/settings/tokens]. -For example, to convert meta-llama/Llama-2-7b-chat-hf -```bash -export MODEL_DOWNLOAD=meta-llama/Llama-2-7b-chat-hf -./scripts/prepare.sh $MODEL_DOWNLOAD +And should see the following output: + ``` +usage: torchat [-h] {chat,generate,eval,export} ... -## Supported Models +Welcome to the torchat CLI! -While we strive to support a broad range of models, we can't test all models. Consequently, we classify supported models as tested βœ…, -work in progress 🚧 and not tested. We invite community contributions of both new models, as well as test reports. +options: + -h, --help show this help message and exit -Some common models are recognized by torchat based on their filename (we use the model constructor `Transformer.from_name()`). We derive this name from the last component of the pathname specified -for the model, i.e., the name of the directory in which the model weights are specified and we perform a fuzzy match against a table of known model architectures. -Alternatively, you can specify the index into that table with the option `--params-table ${INDEX}` where the index is the dictionary key in the `transformer_configs` -dictionary specified [here](https://github.com/pytorch/torchat/blob/main/model.py#L85). For our running example with the stories15M model, this would be expressed as -`--params-table stories15M`. (We use the model constructor `Transformer.from_table()`) +... +``` -For models not specified not in the list of "known configurations", you can construct the model by initializing the `ModelArgs` dataclass that controls model construction from a parameter json -specified using the `params-path ${PARAMS_PATH}` containing the appropriate model parameters to initialize the ModelArgs for the model. (We use the model constructor `Transformer.from_params()`) +If you are planning on use mobile backends, [install ExecuTorch](https://pytorch.org/executorch/stable/getting-started-setup.html) and any hardware-specific libraries and IDEs. -The parameter file will should be in JSON format specifying thee parameters. You can find the Model Args data class in [`model.py`](https://github.com/pytorch/torchat/blob/main/model.py#L22). +  -The final way to initialize a torchat model from a GGUF format, a new file format for storing models. You load a GGUF model with the option --gguf-path ${MODELNAME}.gguf`. Presently, the F16, F32, Q4_0, and Q6_K formats are supported and converted into native torch-chat models. Please refer to section *Loading GGUF* for details. +--- -You may also dequantize GGUF models with the GGUF quantize tool, and then load and requantize with torchat native quantization options. (Please note that quantizing and dequantizing is a lossy process, and you will get the best results by starting with the original unquantized model checkpoint, not a previsoul;y quantized and thend equantized model.) +## Get Started -| Model | tested | eager | torch.compile | AOT Inductor | ET Runtime | Fits on Mobile | -|-----|--------|-------|-----|-----|-----|-----| -tinyllamas/stories15M | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | -tinyllamas/stories42M | - | βœ… | βœ… | βœ… | βœ… | βœ… | -tinyllamas/stories110M | βœ… | βœ… | βœ… | βœ… | βœ… | βœ… | -openlm-research/open_llama_7b | 🚧 | βœ… | βœ… | βœ… | βœ… | ❹ | -meta-llama/Llama-2-7b-chat-hf | 🚧 | βœ… | βœ… | βœ… | βœ… | ❹| -meta-llama/Llama-2-13b-chat-hf | - | βœ… | βœ… | βœ… | βœ… | πŸ“΅ | -meta-llama/Llama-2-70b-chat-hf | - | βœ… | βœ… | βœ… | βœ… | ❌| -codellama/CodeLlama-7b-Python-hf | -| βœ… | βœ… | βœ… | βœ… | ❹| -codellama/CodeLlama-34b-Python-hf | -| βœ… | βœ… | βœ… | βœ… | ❌ | -mistralai/Mistral-7B-v0.1 | 🚧 | βœ… | βœ… | βœ… | βœ… | ❹ | -mistralai/Mistral-7B-Instruct-v0.1 | - | βœ… | βœ… | βœ… | βœ… | ❹ | -mistralai/Mistral-7B-Instruct-v0.2 | - | βœ… | βœ… | βœ… | βœ… | ❹ | -Llama3 | 🚧 | βœ… | βœ… | βœ… | βœ… | ❹ | +Torchat lets you access LLMs through an interactive interface, prompted single-use generation, model export (for use by AOT Inductor and ExecuTorch), and standalone C++ runtimes. -*Key:* βœ… works correctly; 🚧 work in progress; ❌ not supported; ❹ requires 4bit groupwise quantization; πŸ“΅ not on mobile phone (may fit some high-end devices such as tablets); +| Function | Torchat Command | Direct Command | Tested | +|---|----|----|-----| +Download model | `torchat --download` | n/a | 🚧 | +Interactive chat | `torchat --chat` | n/a | 🚧 | +Generate text | `torchat --generate` |`generate` | βœ… | +Evaluate model | `torchat --eval` | `eval` | 🚧 | +Export model | `torchat --export` | `export` | βœ… | +Exported model test (dso,pte) | `torchat --chat` | n/a | 🚧 | +exported model test (dso,pte) | `torchat --generate` |`generate` | βœ… | +Evaluate exported model (dso,pte) | `torchat --eval` | `eval` | 🚧 | +Server C++ runtime | n/a | run.cpp model.so | βœ… | +Server C++ runtime | n/a | run.cpp model.pte | βœ… | +Mobile C++ runtime | n/a | app model.pte | βœ… | +Mobile C++ runtime | n/a | app + AOTI | 🚧 | +Exported models can be loaded back into torchat for chat or text generation, letting you experiment with the exported model and valid model quality. The python interface is the same in all cases and is used for testing nad test harnesses too. -### More downloading +Torchat comes with server C++ runtimes to execute AOT Inductor and ExecuTorch models. Mobile C++ runtimes allow you to deploy ExecuTorch-compiled .pte files on iOS, Android and Raspberry Pi 5. +## Download -First cd into torchat. We first create a directory for stories15M and download the model and tokenizers. -We show how to download @Andrej Karpathy's stories15M tiny llama-style model that were used in llama2.c. Advantageously, -stories15M is both a great example and quick to download and run across a range of platforms, ideal for introductions like this -README and for [testing](https://github.com/pytorch-labs/torchat/blob/main/.github/workflows). We will be using it throughout -this introduction as our running example. +For Llama 2 and 3, follow the instructions on the official [`meta-llama`](https://huggingface.co/meta-llama/Llama-2-7b) repository to ensure you have access to the Llama 2 model weights. Once you have confirmed access, you can run the following command to download the weights to your local machine. This will also download the tokenizer model and a responsible use guide. ``` -# Create directory for model and generated artifacts -export MODEL_NAME=stories15M -export MODEL_DIR=checkpoints/${MODEL_NAME} -mkdir -p ${MODEL_DIR} -# Output directory for exported models and tokenizers - same as checkpoints -# or can use a separate directory -export MODEL_OUT=${MODEL_DIR} +huggingface-cli login +torchat --download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf +``` + +Note: While the ``torchat download`` command allows you to download *any* model from the hub, there's no guarantee that the model can be run with torchat. Currently supported models can be found [here](#introduction) -# Download stories model to stories15M -curl -L -o ${MODEL_DIR}/stories15M.pt "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.pt?download=true" +For stories15M, which we use in this quick start guide, run the following: -# Download tokenizers -curl -L -o ${MODEL_DIR}/tokenizer.model "https://github.com/karpathy/llama2.c/raw/master/tokenizer.model" -curl -L -o ${MODEL_DIR}/tokenizer.bin "https://github.com/karpathy/llama2.c/raw/master/tokenizer.bin" ``` +huggingface-cli login +torchat --download tinyllamas/stories15M --output-dir /tmp/stories15M +``` + +Some common models are recognized by torchat based on their filename through `Transformer.from_name()` to perform a fuzzy match against a table of known model architectures. Alternatively, you can specify the index into that table with the option `--params-table ${INDEX}` where the index is the dictionary key in the `transformer_configs` +dictionary specified [here](https://github.com/pytorch/torchat/blob/main/model.py#L85). For our example, with the stories15M model, this would be expressed as +`--params-table stories15M`. (We use the model constructor `Transformer.from_table()`) + +For models not specified not in the list of known configurations, you can construct the model by initializing the `ModelArgs` dataclass that controls model construction from a parameter json using the `params-path ${PARAMS_PATH}` containing the appropriate model parameters to initialize the ModelArgs for the model. (We use the model constructor `Transformer.from_params()`). + +The parameter file will should be in JSON format specifying thee parameters. You can find the Model Args data class in [`model.py`](https://github.com/pytorch/torchat/blob/main/model.py#L22). + +The final way to initialize a torchat model is from GGUF. You load a GGUF model with the option `--load-gguf ${MODELNAME}.gguf`. Presently, the F16, F32, Q4_0, and Q6_K formats are supported and converted into native torchat models. + +You may also dequantize GGUF models with the GGUF quantize tool, and then load and requantize with torchat native quantization options. (Please note that quantizing and dequantizing is a lossy process, and you will get the best results by starting with the original unquantized model checkpoint, not a previsouly quantized and thend equantized model.) + +| GGUF Model | Tested | Eager | torch.compile | AOT Inductor | ExecuTorch | Fits on Mobile | +|-----|--------|-------|-----|-----|-----|-----| +| llama-2-7b.Q4_0.gguf | 🚧 | 🚧 | 🚧 | 🚧 | 🚧 | + +You may also dequantize GGUF models with the GGUF quantize tool, and then load and requantize with torchat native quantization options. (Please note that quantizing and dequantizing is a lossy process, and you will get the best results by starting with the original unquantized model checkpoint, not a previsoul;y quantized and thend equantized model.) -## Conventions +## Chat We use several variables in this example, which may be set as a preparatory step: @@ -197,18 +186,12 @@ or as Executorch model under the name `${MODEL_NAME}.pte` (for Executorch-genera We use `[ optional input ]` to indicate optional inputs, and `[ choice 1 | choice 2 | ... ]` to indicate a choice - -## A note on tokenizers +### A note on tokenizers There are two different formats for tokenizers, and both are used in this repo. 1 - for generate.py and Python bindings, we use the Google sentencepiece Python operator. This operator consumes a tokenization model in the `tokenizer.model` format. 2 - for C/C++ inference, we use @Andrej Karpathy's C tokenizer function. This tokenizer consumes a tokenization model in the 'tokenizer.bin' format. -If you are using conda, you can install sentencepiece using the following command: -``` -conda install sentencepiece -``` - You can convert tokenizer.model into tokenizer.bin using Andrej's tokenizer.py utility to convert the tokenizer.model to tokenizer.bin format: @@ -223,16 +206,14 @@ environment: ./run ${MODEL_OUT}/model.{so,pte} -z ${MODEL_OUT}/tokenizer.bin ``` -### llama3 tokenizer +### Llama 3 tokenizer -Add option to load tiktoken +Add option to load tiktoken tokenizer ``` --tiktoken ``` -# Generate Text - -## Eager Execution +## Generate Model definition in model.py, generation code in generate.py. The model checkpoint may have extensions `pth` (checkpoint and model definition) or `pt` (model checkpoint). @@ -247,7 +228,17 @@ To squeeze out a little bit more performance, you can also compile the prefill with --compile_prefill. This will increase compilation times though. -## AOT Inductor compilation and execution +## Eval + +## Export + +Let's start by exporting and running a small model like stories15M. + +``` +python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --output-pte-path ${MODEL_OUT}/model.pte +``` + +### AOT Inductor compilation and execution ``` python export.py --checkpoint-path ${MODEL_PATH} --device {cuda,cpu} --output-dso-path ${MODEL_OUT}/${MODEL_NAME}.so ``` @@ -270,22 +261,13 @@ execution engines while they are waiting for data. We use quantization to achieve this, as described below. -## ExecuTorch mobile compilation +### ExecuTorch mobile compilation We export the model with the export.py script. Running this script requires you first install executorch with pybindings, see [here](#setting-up-executorch-and-runner-et). At present, when exporting a model, the export command always uses the xnnpack delegate to export. (Future versions of torchat will support additional delegates such as Vulkan, CoreML, MPS, HTP in addition to Xnnpack as they are released for Executorch.) - -### Exporting the model -Let's start by exporting and running a small model like stories15M. - - -``` -python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --output-pte-path ${MODEL_OUT}/model.pte -``` - ### Running the model With the model exported, you can now generate text with the executorch runtime pybindings. Feel free to play around with the prompt. @@ -301,7 +283,7 @@ device supported by Executorch, most models need to be compressed to fit in the target device's memory. We use quantization to achieve this. -# llama3 support +## Llama 3 support How to obtain snapshot (to be filled in when published by Meta, we use internal snapshot] @@ -313,9 +295,9 @@ Identify and enable a runner/run.cpp with a binary tiktoken optimizer. (May alr we cannot presently run runner/run.cpp with llama3, until we have a C/C++ tokenizer im[plementation (initial tiktoken is python) -# Optimizing your model for server, desktop and mobile devices +## Optimizing your model for server, desktop and mobile devices -## Model precision (dtype precision setting)_ +## Model precision (dtype precision setting) You can generate models (for both export and generate, with eager, torch.compile, AOTI, ET, for all backends - mobile at present will primarily support fp32, with all options) specify the precision of the model with @@ -326,7 +308,6 @@ python export.py --dtype [bf16 | fp16 | fp32] ... Unlike gpt-fast which uses bfloat16 as default, Torch@ uses float32 as the default. As a consequence you will have to set to `--dtype bf16` or `--dtype fp16` on server / desktop for best performance. - ## Making your models fit and execute fast! Next, we'll show you how to optimize your model for mobile execution @@ -377,7 +358,6 @@ Now you can run your model with the same command as before: python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --prompt "Hello my name is" ``` - *Groupwise quantization*: We can do this in eager mode (optionally with `torch.compile`), we use the `embedding` quantizer by specifying the group size: @@ -396,8 +376,6 @@ Now you can run your model with the same command as before: python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte --prompt "Hello my name is" ``` - - #### Linear 8 bit integer quantization (channel-wise and groupwise) The simplest way to quantize linear operators is with int8 quantization, where each value is represented by an 8-bit integer, and a floating point scale: @@ -510,7 +488,7 @@ We invite contributors to submit established quantization schemes, with accuracy # Loading GGUF models -GGUF is a nascent industry standard format and presently torchat can read the F16, F32, Q4_0, and Q6_K formats natively and convert them into native torch-chat models by using the load-gguf option: +GGUF is a nascent industry standard format and presently torchat can read the F16, F32, Q4_0, and Q6_K formats natively and convert them into native torchat models by using the load-gguf option: ``` --gguf-path # all other options as described elsewhere, works for generate and export, for all backends, but cannot be used with --quantize @@ -609,10 +587,10 @@ Detailed step by step in conjunction with ET iOS build, to run on simulator for # Supported Systems -PyTorch and the mobile Executorch backend support a broad range of devices for running PyTorch with Python (using either eager or eager + `torch.compile`) or using a Python-free environment with AOT Inductor, as well as runtimes for executing exported models. +PyTorch and ExecuTorch support a broad range of devices for running PyTorch with python (using either eager or eager + `torch.compile`) or in a python-free environment with AOT Inductor and ExecuTorch. -| Hardware | OS | eager | eager + compile | AOT compile | ET Runtime | +| Hardware | OS | Eager | Eager + Compile | AOT Compile | ET Runtime | |-----|------|-----|-----|-----|-----| | x86 | Linux | βœ… | βœ… | βœ… | βœ… | | x86 | macOS | ? | ? | ? | ? | @@ -718,27 +696,33 @@ The built executable is located at ./build/cmake-out/runner-et. for mobile and runner, if we can get a C/C++ tokenizer +  -### Raspberry Pi 5 instructions +--- -Expanded version of digant's note. +  -# Acknowledgements +## Acknowledgements -A big thank you to +* Georgi Gerganov and his [GGML](https://github.com/ggerganov/ggml) project shining a spotlight on community-based enablement and inspiring so many other projects. +* Andrej Karpathy and his [llama2.c](https://github.com/karpathy/llama2.c) project. So many great (and simple!) ideas in llama2.c that we have directly adopted (both ideas and code) from his repo. You can never go wrong by following Andrej's work. +* Bert Maher and his [llama2.so](https://github.com/bertmaher/llama2.so), which built on Andrej's llama2.c and closed the +loop on Llama models with AOTInductor. +* Christian Puhrsch, Horace He, Joe Isaacson and many more for their many contributions in Accelerating GenAI models in +the *"Anything, Fast!"* pytorch.org blogs, and, in particular, Horace He for [GPT, Fast!](https://github.com/pytorch-labs/gpt-fast), which we have +directly adopted (both ideas and code) from his repo. +* Bert Maher, Scott Wolchok, Bin Bao, Chen Yang, Huamin Li and Mu-Chu Li for great collaborations +in building AOTInductor for CPU including for [nanoGPT](https://github.com/karpathy/nanoGPT). -* Georgi Gerganov and his [GGML](https://github.com/ggerganov/ggml) project that helped shine a spotlight -on community-based enablement, and inspired so many other projects. +  -* Andrej Karpathy and his [llama2.c](https://github.com/karpathy/llama2.c) project. So many great (and simple!) ideas in llama2.c that we -have directly adopted (both ideas and code) from his repo. You can never go wrong by following Andrej's work! +## Contributing -* my colleague and friend Bert Maher and [llama2.so](https://github.com/bertmaher/llama2.so) who build on Andrej's llama2.c and closed the -loop on llama models. The llama2.c integration with AOT Inductor comes from Bert's repo. +We welcome any feature requests, bug reports, or pull requests from the community. See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out. -* my colleagues and friends Christian Puhrsch, Horace He, Joe Isaacson, and many more for their many contributions in Accelerating GenAI models in -the *"Anything, Fast!"* blog series, and in particular Horace He for [GPT, Fast!](https://github.com/pytorch-labs/gpt-fast) that we have -directly adopted (both ideas and code) from his repo. +  + +## License + +Torchat is released under the [BSD 3 license](./LICENSE). However you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models. -* my colleagues and friends Bert Maher, Scott Wolchok, Bin Bao, Chen Yang, Huamin Li and Mu-Chu Li for a great collaboration -in building AOT Inductor for CPU, internal use cases and an experimental AOTI-compiled inference version of [nanoGPT](https://github.com/karpathy/nanoGPT).