pytorch · byjlw · May 6, 2024 · Apr 29, 2024 · Apr 29, 2024 · Apr 29, 2024
diff --git a/.gitignore b/.gitignore
@@ -8,6 +8,7 @@ __pycache__/
 
 .model-artifacts/
 .venv
+.torchchat
 
 # Build directories
 build/android/*

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # Chat with LLMs Everywhere
-torchchat is a small codebase showcasing the ability to run large language models (LLMs) seamlessly. With torchchat, you can run LLMs using Python, within your own (C/C++) application (desktop or server) and on iOS and Android.
 
+torchchat is a small codebase showcasing the ability to run large language models (LLMs) seamlessly. With torchchat, you can run LLMs using Python, within your own (C/C++) application (desktop or server) and on iOS and Android.
 
 
 ## What can you do with torchchat?
@@ -19,7 +19,7 @@ torchchat is a small codebase showcasing the ability to run large language model
   - [Deploy and run on iOS](#deploy-and-run-on-ios)
   - [Deploy and run on Android](#deploy-and-run-on-android)
 - [Evaluate a mode](#eval)
-- [Fine-tuned models from torchtune](#fine-tuned-models-from-torchtune)
+- [Fine-tuned models from torchtune](docs/torchtune.md)
 - [Supported Models](#models)
 - [Troubleshooting](#troubleshooting)
 
@@ -37,13 +37,7 @@ torchchat is a small codebase showcasing the ability to run large language model
 - Multiple quantization schemes
 - Multiple execution modes including: Python (Eager, Compile) or Native (AOT Inductor (AOTI), ExecuTorch)
 
-### Disclaimer
-The torchchat Repository Content is provided without any guarantees about performance or compatibility. In particular, torchchat makes available model architectures written in Python for PyTorch that may not perform in the same manner or meet the same standards as the original versions of those models. When using the torchchat Repository Content, including any model architectures, you are solely responsible for determining the appropriateness of using or redistributing the torchchat Repository Content and assume any risks associated with your use of the torchchat Repository Content or any models, outputs, or results, both alone and in combination with any other technologies. Additionally, you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, weights, data, or other technologies, and you are solely responsible for complying with all such obligations.
-
-
 ## Installation
-
-
 The following steps require that you have [Python 3.10](https://www.python.org/downloads/release/python-3100/) installed.
 
 ```bash
@@ -89,7 +83,12 @@ View available models with:
 python3 torchchat.py list
 ```
 
-You can also remove downloaded models with the remove command: `python3 torchchat.py remove llama3`
+
+You can also remove downloaded models with the remove command:
+```
+python3 torchchat.py remove llama3
+```
+
 
 ## Running via PyTorch / Python
 [Follow the installation steps if you haven't](#installation)
@@ -104,15 +103,15 @@ For more information run `python3 torchchat.py chat --help`
 
 ### Generate
 ```bash
-python3 torchchat.py generate llama3
+python3 torchchat.py generate llama3 --prompt "write me a story about a boy and his bear"
 ```
 
 For more information run `python3 torchchat.py generate --help`
 
 ### Browser
 
 ```
-python3 torchchat.py browser llama3 --temperature 0 --num-samples 10
+python3 torchchat.py browser llama3
 ```
 
 *Running on http://127.0.0.1:5000* should be printed out on the terminal. Click the link or go to [http://127.0.0.1:5000](http://127.0.0.1:5000) on your browser to start interacting with it.
@@ -126,16 +125,17 @@ Enter some text in the input box, then hit the enter key or click the “SEND”
 ### AOTI (AOT Inductor)
 AOT compiles models before execution for faster inference
 
-The following example exports and executes the Llama3 8B Instruct model
+The following example exports and executes the Llama3 8B Instruct model.  (The first command performs the actual export, the second command loads the exported model into the Python interface to enable users to test the exported model.)
+
 ```
 # Compile
-python3 torchchat.py export llama3 --output-dso-path llama3.so
+python3 torchchat.py export llama3 --output-dso-path exportedModels/llama3.so
 
 # Execute the exported model using Python
-python3 torchchat.py generate llama3 --quantize config/data/cuda.json --dso-path llama3.so --prompt "Hello my name is"
-```
 
-NOTE: We use `--quantize config/data/cuda.json` to quantize the llama3 model to reduce model size and improve performance for on-device use cases.
+python3 torchchat.py generate llama3 --dso-path exportedModels/llama3.so --prompt "Hello my name is"
+```
+NOTE: If you're machine has cuda add this flag for performance `--quantize config/data/cuda.json`
 
 ### Running native using our C++ Runner
 
@@ -148,7 +148,7 @@ scripts/build_native.sh aoti
 
 Execute
 ```bash
-cmake-out/aoti_run model.so -z tokenizer.model -l 3 -i "Once upon a time"
+cmake-out/aoti_run exportedModels/llama3.so -z .model-artifacts/meta-llama/Meta-Llama-3-8B-Instruct/tokenizer.model -l 3 -i "Once upon a time"
 ```
 
 ## Mobile Execution
@@ -159,13 +159,15 @@ Before running any commands in torchchat that require ExecuTorch, you must first
 
 To install ExecuTorch, run the following commands *from the torchchat root directory*.
 This will download the ExecuTorch repo to ./et-build/src and install various ExecuTorch libraries to ./et-build/install.
+
 ```
 export TORCHCHAT_ROOT=$PWD
 ./scripts/install_et.sh
 ```
 
 ### Export for mobile
 The following example uses the Llama3 8B Instruct model.
+
 ```
 # Export
 python3 torchchat.py export llama3 --quantize config/data/mobile.json --output-pte-path llama3.pte
@@ -201,39 +203,11 @@ Now, follow the app's UI guidelines to pick the model and tokenizer files from t
   <img src="https://pytorch.org/executorch/main/_static/img/llama_ios_app.png" width="600" alt="iOS app running a LlaMA model">
 </a>
 
-### Deploy and run on Android
 
+### Deploy and run on Android
 
-## Fine-tuned models from torchtune
 
-torchchat supports running inference with models fine-tuned using [torchtune](https://github.com/pytorch/torchtune). To do so, we first need to convert the checkpoints into a format supported by torchchat.
 
-Below is a simple workflow to run inference on a fine-tuned Llama3 model. For more details on how to fine-tune Llama3, see the instructions [here](https://github.com/pytorch/torchtune?tab=readme-ov-file#llama3)
-
-```bash
-# install torchtune
-pip install torchtune
-
-# download the llama3 model
-tune download meta-llama/Meta-Llama-3-8B \
-    --output-dir ./Meta-Llama-3-8B \
-    --hf-token <ACCESS TOKEN>
-
-# Run LoRA fine-tuning on a single device. This assumes the config points to <checkpoint_dir> above
-tune run lora_finetune_single_device --config llama3/8B_lora_single_device
-
-# convert the fine-tuned checkpoint to a format compatible with torchchat
-python3 build/convert_torchtune_checkpoint.py \
-  --checkpoint-dir ./Meta-Llama-3-8B \
-  --checkpoint-files meta_model_0.pt \
-  --model-name llama3_8B \
-  --checkpoint-format meta
-
-# run inference on a single GPU
-python3 torchchat.py generate \
-  --checkpoint-path ./Meta-Llama-3-8B/model.pth \
-  --device cuda
-```
 
 ### Eval
 Uses the lm_eval library to evaluate model accuracy on a variety of tasks. Defaults to wikitext and can be manually controlled using the tasks and limit args.
@@ -282,12 +256,17 @@ While we describe how to use torchchat using the popular llama3 model, you can p
 
 ## Troubleshooting
 
-**CERTIFICATE_VERIFY_FAILED**:
+
+**CERTIFICATE_VERIFY_FAILED**
 Run `pip install --upgrade certifi`.
 
-**Access to model is restricted and you are not in the authorized list.**
+**Access to model is restricted and you are not in the authorized list**
 Some models require an additional step to access. Follow the link provided in the error to get access.
 
+### Disclaimer
+The torchchat Repository Content is provided without any guarantees about performance or compatibility. In particular, torchchat makes available model architectures written in Python for PyTorch that may not perform in the same manner or meet the same standards as the original versions of those models. When using the torchchat Repository Content, including any model architectures, you are solely responsible for determining the appropriateness of using or redistributing the torchchat Repository Content and assume any risks associated with your use of the torchchat Repository Content or any models, outputs, or results, both alone and in combination with any other technologies. Additionally, you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, weights, data, or other technologies, and you are solely responsible for complying with all such obligations.
+
+
 ## Acknowledgements
 Thank you to the [community](docs/ACKNOWLEDGEMENTS.md) for all the awesome libraries and tools
 you've built around local LLM inference.

diff --git a/docs/torchtune.md b/docs/torchtune.md
@@ -0,0 +1,30 @@
+# Fine-tuned models from torchtune
+
+torchchat supports running inference with models fine-tuned using [torchtune](https://github.com/pytorch/torchtune). To do so, we first need to convert the checkpoints into a format supported by torchchat.
+
+Below is a simple workflow to run inference on a fine-tuned Llama3 model. For more details on how to fine-tune Llama3, see the instructions [here](https://github.com/pytorch/torchtune?tab=readme-ov-file#llama3)
+
+```bash
+# install torchtune
+pip install torchtune
+
+# download the llama3 model
+tune download meta-llama/Meta-Llama-3-8B \
+    --output-dir ./Meta-Llama-3-8B \
+    --hf-token <ACCESS TOKEN>
+
+# Run LoRA fine-tuning on a single device. This assumes the config points to <checkpoint_dir> above
+tune run lora_finetune_single_device --config llama3/8B_lora_single_device
+
+# convert the fine-tuned checkpoint to a format compatible with torchchat
+python3 build/convert_torchtune_checkpoint.py \
+  --checkpoint-dir ./Meta-Llama-3-8B \
+  --checkpoint-files meta_model_0.pt \
+  --model-name llama3_8B \
+  --checkpoint-format meta
+
+# run inference on a single GPU
+python3 torchchat.py generate \
+  --checkpoint-path ./Meta-Llama-3-8B/model.pth \
+  --device cuda
+```