Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and Fix the Readme #563

Merged
merged 16 commits into from
May 6, 2024
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ __pycache__/

.model-artifacts/
.venv
.torchchat

# Build directories
build/android/*
Expand Down
75 changes: 27 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Chat with LLMs Everywhere
torchchat is a small codebase showcasing the ability to run large language models (LLMs) seamlessly. With torchchat, you can run LLMs using Python, within your own (C/C++) application (desktop or server) and on iOS and Android.

torchchat is a small codebase showcasing the ability to run large language models (LLMs) seamlessly. With torchchat, you can run LLMs using Python, within your own (C/C++) application (desktop or server) and on iOS and Android.


## What can you do with torchchat?
Expand All @@ -19,7 +19,7 @@ torchchat is a small codebase showcasing the ability to run large language model
- [Deploy and run on iOS](#deploy-and-run-on-ios)
- [Deploy and run on Android](#deploy-and-run-on-android)
- [Evaluate a mode](#eval)
- [Fine-tuned models from torchtune](#fine-tuned-models-from-torchtune)
- [Fine-tuned models from torchtune](docs/torchtune.md)
- [Supported Models](#models)
- [Troubleshooting](#troubleshooting)

Expand All @@ -37,13 +37,7 @@ torchchat is a small codebase showcasing the ability to run large language model
- Multiple quantization schemes
- Multiple execution modes including: Python (Eager, Compile) or Native (AOT Inductor (AOTI), ExecuTorch)

### Disclaimer
The torchchat Repository Content is provided without any guarantees about performance or compatibility. In particular, torchchat makes available model architectures written in Python for PyTorch that may not perform in the same manner or meet the same standards as the original versions of those models. When using the torchchat Repository Content, including any model architectures, you are solely responsible for determining the appropriateness of using or redistributing the torchchat Repository Content and assume any risks associated with your use of the torchchat Repository Content or any models, outputs, or results, both alone and in combination with any other technologies. Additionally, you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, weights, data, or other technologies, and you are solely responsible for complying with all such obligations.


## Installation


The following steps require that you have [Python 3.10](https://www.python.org/downloads/release/python-3100/) installed.

```bash
Expand Down Expand Up @@ -89,7 +83,12 @@ View available models with:
python3 torchchat.py list
```

You can also remove downloaded models with the remove command: `python3 torchchat.py remove llama3`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit on this one is that I'd actually like them not to run it, so actually better to have inline?

Copy link
Contributor

@mikekgfb mikekgfb May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Yes, I commented the same and edited the command in the README.md, togther with many others. And no, I don't deal with branches. Just submit your stuffrather than having it in a side branch and just let it drop from heaven, like a pino in a cartoon.


You can also remove downloaded models with the remove command:
```
python3 torchchat.py remove llama3
```


## Running via PyTorch / Python
[Follow the installation steps if you haven't](#installation)
Expand All @@ -104,15 +103,15 @@ For more information run `python3 torchchat.py chat --help`

### Generate
```bash
python3 torchchat.py generate llama3
python3 torchchat.py generate llama3 --prompt "write me a story about a boy and his bear"
```

For more information run `python3 torchchat.py generate --help`

### Browser

```
python3 torchchat.py browser llama3 --temperature 0 --num-samples 10
python3 torchchat.py browser llama3
```

*Running on http://127.0.0.1:5000* should be printed out on the terminal. Click the link or go to [http://127.0.0.1:5000](http://127.0.0.1:5000) on your browser to start interacting with it.
Expand All @@ -126,16 +125,17 @@ Enter some text in the input box, then hit the enter key or click the “SEND”
### AOTI (AOT Inductor)
AOT compiles models before execution for faster inference

The following example exports and executes the Llama3 8B Instruct model
The following example exports and executes the Llama3 8B Instruct model. (The first command performs the actual export, the second command loads the exported model into the Python interface to enable users to test the exported model.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add parenthesis on that second sentence?


```
# Compile
python3 torchchat.py export llama3 --output-dso-path llama3.so
python3 torchchat.py export llama3 --output-dso-path exportedModels/llama3.so

# Execute the exported model using Python
python3 torchchat.py generate llama3 --quantize config/data/cuda.json --dso-path llama3.so --prompt "Hello my name is"
```

NOTE: We use `--quantize config/data/cuda.json` to quantize the llama3 model to reduce model size and improve performance for on-device use cases.
python3 torchchat.py generate llama3 --dso-path exportedModels/llama3.so --prompt "Hello my name is"
```
NOTE: If you're machine has cuda add this flag for performance `--quantize config/data/cuda.json`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: has cuda -> has CUDA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: "you're machine" => "your machine"


### Running native using our C++ Runner

Expand All @@ -148,7 +148,7 @@ scripts/build_native.sh aoti

Execute
```bash
cmake-out/aoti_run model.so -z tokenizer.model -l 3 -i "Once upon a time"
cmake-out/aoti_run exportedModels/llama3.so -z .model-artifacts/meta-llama/Meta-Llama-3-8B-Instruct/tokenizer.model -l 3 -i "Once upon a time"
```

## Mobile Execution
Expand All @@ -159,13 +159,15 @@ Before running any commands in torchchat that require ExecuTorch, you must first

To install ExecuTorch, run the following commands *from the torchchat root directory*.
This will download the ExecuTorch repo to ./et-build/src and install various ExecuTorch libraries to ./et-build/install.

```
export TORCHCHAT_ROOT=$PWD
./scripts/install_et.sh
```

### Export for mobile
The following example uses the Llama3 8B Instruct model.

```
# Export
python3 torchchat.py export llama3 --quantize config/data/mobile.json --output-pte-path llama3.pte
Expand Down Expand Up @@ -201,39 +203,11 @@ Now, follow the app's UI guidelines to pick the model and tokenizer files from t
<img src="https://pytorch.org/executorch/main/_static/img/llama_ios_app.png" width="600" alt="iOS app running a LlaMA model">
</a>

### Deploy and run on Android

### Deploy and run on Android

## Fine-tuned models from torchtune

torchchat supports running inference with models fine-tuned using [torchtune](https://github.com/pytorch/torchtune). To do so, we first need to convert the checkpoints into a format supported by torchchat.

Below is a simple workflow to run inference on a fine-tuned Llama3 model. For more details on how to fine-tune Llama3, see the instructions [here](https://github.com/pytorch/torchtune?tab=readme-ov-file#llama3)

```bash
# install torchtune
pip install torchtune

# download the llama3 model
tune download meta-llama/Meta-Llama-3-8B \
--output-dir ./Meta-Llama-3-8B \
--hf-token <ACCESS TOKEN>

# Run LoRA fine-tuning on a single device. This assumes the config points to <checkpoint_dir> above
tune run lora_finetune_single_device --config llama3/8B_lora_single_device

# convert the fine-tuned checkpoint to a format compatible with torchchat
python3 build/convert_torchtune_checkpoint.py \
--checkpoint-dir ./Meta-Llama-3-8B \
--checkpoint-files meta_model_0.pt \
--model-name llama3_8B \
--checkpoint-format meta

# run inference on a single GPU
python3 torchchat.py generate \
--checkpoint-path ./Meta-Llama-3-8B/model.pth \
--device cuda
```

### Eval
Uses the lm_eval library to evaluate model accuracy on a variety of tasks. Defaults to wikitext and can be manually controlled using the tasks and limit args.
Expand Down Expand Up @@ -282,12 +256,17 @@ While we describe how to use torchchat using the popular llama3 model, you can p

## Troubleshooting

**CERTIFICATE_VERIFY_FAILED**:

**CERTIFICATE_VERIFY_FAILED**
Run `pip install --upgrade certifi`.

**Access to model is restricted and you are not in the authorized list.**
**Access to model is restricted and you are not in the authorized list**
Some models require an additional step to access. Follow the link provided in the error to get access.

### Disclaimer
The torchchat Repository Content is provided without any guarantees about performance or compatibility. In particular, torchchat makes available model architectures written in Python for PyTorch that may not perform in the same manner or meet the same standards as the original versions of those models. When using the torchchat Repository Content, including any model architectures, you are solely responsible for determining the appropriateness of using or redistributing the torchchat Repository Content and assume any risks associated with your use of the torchchat Repository Content or any models, outputs, or results, both alone and in combination with any other technologies. Additionally, you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, weights, data, or other technologies, and you are solely responsible for complying with all such obligations.


## Acknowledgements
Thank you to the [community](docs/ACKNOWLEDGEMENTS.md) for all the awesome libraries and tools
you've built around local LLM inference.
Expand Down
30 changes: 30 additions & 0 deletions docs/torchtune.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Fine-tuned models from torchtune

torchchat supports running inference with models fine-tuned using [torchtune](https://github.com/pytorch/torchtune). To do so, we first need to convert the checkpoints into a format supported by torchchat.

Below is a simple workflow to run inference on a fine-tuned Llama3 model. For more details on how to fine-tune Llama3, see the instructions [here](https://github.com/pytorch/torchtune?tab=readme-ov-file#llama3)

```bash
# install torchtune
pip install torchtune

# download the llama3 model
tune download meta-llama/Meta-Llama-3-8B \
--output-dir ./Meta-Llama-3-8B \
--hf-token <ACCESS TOKEN>

# Run LoRA fine-tuning on a single device. This assumes the config points to <checkpoint_dir> above
tune run lora_finetune_single_device --config llama3/8B_lora_single_device

# convert the fine-tuned checkpoint to a format compatible with torchchat
python3 build/convert_torchtune_checkpoint.py \
--checkpoint-dir ./Meta-Llama-3-8B \
--checkpoint-files meta_model_0.pt \
--model-name llama3_8B \
--checkpoint-format meta

# run inference on a single GPU
python3 torchchat.py generate \
--checkpoint-path ./Meta-Llama-3-8B/model.pth \
--device cuda
```
Loading