-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor and Fix the Readme #563
Changes from 15 commits
c2c87ad
0ba5541
9523a61
16318dd
b97613e
ab2598b
c1b8c6a
5fc3e3f
1421333
7f88143
0e52a37
88f0272
b82212f
6037d97
f4d9a18
2b997a8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,6 +8,7 @@ __pycache__/ | |
|
||
.model-artifacts/ | ||
.venv | ||
.torchchat | ||
|
||
# Build directories | ||
build/android/* | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
# Chat with LLMs Everywhere | ||
torchchat is a small codebase showcasing the ability to run large language models (LLMs) seamlessly. With torchchat, you can run LLMs using Python, within your own (C/C++) application (desktop or server) and on iOS and Android. | ||
|
||
torchchat is a small codebase showcasing the ability to run large language models (LLMs) seamlessly. With torchchat, you can run LLMs using Python, within your own (C/C++) application (desktop or server) and on iOS and Android. | ||
|
||
|
||
## What can you do with torchchat? | ||
|
@@ -19,7 +19,7 @@ torchchat is a small codebase showcasing the ability to run large language model | |
- [Deploy and run on iOS](#deploy-and-run-on-ios) | ||
- [Deploy and run on Android](#deploy-and-run-on-android) | ||
- [Evaluate a mode](#eval) | ||
- [Fine-tuned models from torchtune](#fine-tuned-models-from-torchtune) | ||
- [Fine-tuned models from torchtune](docs/torchtune.md) | ||
- [Supported Models](#models) | ||
- [Troubleshooting](#troubleshooting) | ||
|
||
|
@@ -37,13 +37,7 @@ torchchat is a small codebase showcasing the ability to run large language model | |
- Multiple quantization schemes | ||
- Multiple execution modes including: Python (Eager, Compile) or Native (AOT Inductor (AOTI), ExecuTorch) | ||
|
||
### Disclaimer | ||
The torchchat Repository Content is provided without any guarantees about performance or compatibility. In particular, torchchat makes available model architectures written in Python for PyTorch that may not perform in the same manner or meet the same standards as the original versions of those models. When using the torchchat Repository Content, including any model architectures, you are solely responsible for determining the appropriateness of using or redistributing the torchchat Repository Content and assume any risks associated with your use of the torchchat Repository Content or any models, outputs, or results, both alone and in combination with any other technologies. Additionally, you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, weights, data, or other technologies, and you are solely responsible for complying with all such obligations. | ||
|
||
|
||
## Installation | ||
|
||
|
||
The following steps require that you have [Python 3.10](https://www.python.org/downloads/release/python-3100/) installed. | ||
|
||
```bash | ||
|
@@ -89,7 +83,12 @@ View available models with: | |
python3 torchchat.py list | ||
``` | ||
|
||
You can also remove downloaded models with the remove command: `python3 torchchat.py remove llama3` | ||
|
||
You can also remove downloaded models with the remove command: | ||
``` | ||
python3 torchchat.py remove llama3 | ||
``` | ||
|
||
|
||
## Running via PyTorch / Python | ||
[Follow the installation steps if you haven't](#installation) | ||
|
@@ -104,15 +103,15 @@ For more information run `python3 torchchat.py chat --help` | |
|
||
### Generate | ||
```bash | ||
python3 torchchat.py generate llama3 | ||
python3 torchchat.py generate llama3 --prompt "write me a story about a boy and his bear" | ||
``` | ||
|
||
For more information run `python3 torchchat.py generate --help` | ||
|
||
### Browser | ||
|
||
``` | ||
python3 torchchat.py browser llama3 --temperature 0 --num-samples 10 | ||
python3 torchchat.py browser llama3 | ||
``` | ||
|
||
*Running on http://127.0.0.1:5000* should be printed out on the terminal. Click the link or go to [http://127.0.0.1:5000](http://127.0.0.1:5000) on your browser to start interacting with it. | ||
|
@@ -126,16 +125,17 @@ Enter some text in the input box, then hit the enter key or click the “SEND” | |
### AOTI (AOT Inductor) | ||
AOT compiles models before execution for faster inference | ||
|
||
The following example exports and executes the Llama3 8B Instruct model | ||
The following example exports and executes the Llama3 8B Instruct model. (The first command performs the actual export, the second command loads the exported model into the Python interface to enable users to test the exported model.) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No need to add parenthesis on that second sentence? |
||
|
||
``` | ||
# Compile | ||
python3 torchchat.py export llama3 --output-dso-path llama3.so | ||
python3 torchchat.py export llama3 --output-dso-path exportedModels/llama3.so | ||
|
||
# Execute the exported model using Python | ||
python3 torchchat.py generate llama3 --quantize config/data/cuda.json --dso-path llama3.so --prompt "Hello my name is" | ||
``` | ||
|
||
NOTE: We use `--quantize config/data/cuda.json` to quantize the llama3 model to reduce model size and improve performance for on-device use cases. | ||
python3 torchchat.py generate llama3 --dso-path exportedModels/llama3.so --prompt "Hello my name is" | ||
``` | ||
NOTE: If you're machine has cuda add this flag for performance `--quantize config/data/cuda.json` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: has cuda -> has CUDA There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also: "you're machine" => "your machine" |
||
|
||
### Running native using our C++ Runner | ||
|
||
|
@@ -148,7 +148,7 @@ scripts/build_native.sh aoti | |
|
||
Execute | ||
```bash | ||
cmake-out/aoti_run model.so -z tokenizer.model -l 3 -i "Once upon a time" | ||
cmake-out/aoti_run exportedModels/llama3.so -z .model-artifacts/meta-llama/Meta-Llama-3-8B-Instruct/tokenizer.model -l 3 -i "Once upon a time" | ||
``` | ||
|
||
## Mobile Execution | ||
|
@@ -159,13 +159,15 @@ Before running any commands in torchchat that require ExecuTorch, you must first | |
|
||
To install ExecuTorch, run the following commands *from the torchchat root directory*. | ||
This will download the ExecuTorch repo to ./et-build/src and install various ExecuTorch libraries to ./et-build/install. | ||
|
||
``` | ||
export TORCHCHAT_ROOT=$PWD | ||
./scripts/install_et.sh | ||
``` | ||
|
||
### Export for mobile | ||
The following example uses the Llama3 8B Instruct model. | ||
|
||
``` | ||
# Export | ||
python3 torchchat.py export llama3 --quantize config/data/mobile.json --output-pte-path llama3.pte | ||
|
@@ -201,39 +203,11 @@ Now, follow the app's UI guidelines to pick the model and tokenizer files from t | |
<img src="https://pytorch.org/executorch/main/_static/img/llama_ios_app.png" width="600" alt="iOS app running a LlaMA model"> | ||
</a> | ||
|
||
### Deploy and run on Android | ||
|
||
### Deploy and run on Android | ||
|
||
## Fine-tuned models from torchtune | ||
|
||
torchchat supports running inference with models fine-tuned using [torchtune](https://github.com/pytorch/torchtune). To do so, we first need to convert the checkpoints into a format supported by torchchat. | ||
|
||
Below is a simple workflow to run inference on a fine-tuned Llama3 model. For more details on how to fine-tune Llama3, see the instructions [here](https://github.com/pytorch/torchtune?tab=readme-ov-file#llama3) | ||
|
||
```bash | ||
# install torchtune | ||
pip install torchtune | ||
|
||
# download the llama3 model | ||
tune download meta-llama/Meta-Llama-3-8B \ | ||
--output-dir ./Meta-Llama-3-8B \ | ||
--hf-token <ACCESS TOKEN> | ||
|
||
# Run LoRA fine-tuning on a single device. This assumes the config points to <checkpoint_dir> above | ||
tune run lora_finetune_single_device --config llama3/8B_lora_single_device | ||
|
||
# convert the fine-tuned checkpoint to a format compatible with torchchat | ||
python3 build/convert_torchtune_checkpoint.py \ | ||
--checkpoint-dir ./Meta-Llama-3-8B \ | ||
--checkpoint-files meta_model_0.pt \ | ||
--model-name llama3_8B \ | ||
--checkpoint-format meta | ||
|
||
# run inference on a single GPU | ||
python3 torchchat.py generate \ | ||
--checkpoint-path ./Meta-Llama-3-8B/model.pth \ | ||
--device cuda | ||
``` | ||
|
||
### Eval | ||
Uses the lm_eval library to evaluate model accuracy on a variety of tasks. Defaults to wikitext and can be manually controlled using the tasks and limit args. | ||
|
@@ -282,12 +256,17 @@ While we describe how to use torchchat using the popular llama3 model, you can p | |
|
||
## Troubleshooting | ||
|
||
**CERTIFICATE_VERIFY_FAILED**: | ||
|
||
**CERTIFICATE_VERIFY_FAILED** | ||
Run `pip install --upgrade certifi`. | ||
|
||
**Access to model is restricted and you are not in the authorized list.** | ||
**Access to model is restricted and you are not in the authorized list** | ||
Some models require an additional step to access. Follow the link provided in the error to get access. | ||
|
||
### Disclaimer | ||
The torchchat Repository Content is provided without any guarantees about performance or compatibility. In particular, torchchat makes available model architectures written in Python for PyTorch that may not perform in the same manner or meet the same standards as the original versions of those models. When using the torchchat Repository Content, including any model architectures, you are solely responsible for determining the appropriateness of using or redistributing the torchchat Repository Content and assume any risks associated with your use of the torchchat Repository Content or any models, outputs, or results, both alone and in combination with any other technologies. Additionally, you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, weights, data, or other technologies, and you are solely responsible for complying with all such obligations. | ||
|
||
|
||
## Acknowledgements | ||
Thank you to the [community](docs/ACKNOWLEDGEMENTS.md) for all the awesome libraries and tools | ||
you've built around local LLM inference. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Fine-tuned models from torchtune | ||
|
||
torchchat supports running inference with models fine-tuned using [torchtune](https://github.com/pytorch/torchtune). To do so, we first need to convert the checkpoints into a format supported by torchchat. | ||
|
||
Below is a simple workflow to run inference on a fine-tuned Llama3 model. For more details on how to fine-tune Llama3, see the instructions [here](https://github.com/pytorch/torchtune?tab=readme-ov-file#llama3) | ||
|
||
```bash | ||
# install torchtune | ||
pip install torchtune | ||
|
||
# download the llama3 model | ||
tune download meta-llama/Meta-Llama-3-8B \ | ||
--output-dir ./Meta-Llama-3-8B \ | ||
--hf-token <ACCESS TOKEN> | ||
|
||
# Run LoRA fine-tuning on a single device. This assumes the config points to <checkpoint_dir> above | ||
tune run lora_finetune_single_device --config llama3/8B_lora_single_device | ||
|
||
# convert the fine-tuned checkpoint to a format compatible with torchchat | ||
python3 build/convert_torchtune_checkpoint.py \ | ||
--checkpoint-dir ./Meta-Llama-3-8B \ | ||
--checkpoint-files meta_model_0.pt \ | ||
--model-name llama3_8B \ | ||
--checkpoint-format meta | ||
|
||
# run inference on a single GPU | ||
python3 torchchat.py generate \ | ||
--checkpoint-path ./Meta-Llama-3-8B/model.pth \ | ||
--device cuda | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit on this one is that I'd actually like them not to run it, so actually better to have inline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Yes, I commented the same and edited the command in the README.md, togther with many others. And no, I don't deal with branches. Just submit your stuffrather than having it in a side branch and just let it drop from heaven, like a pino in a cartoon.