Process readme (pytorch#665)

* executable README * fix title of CI workflow * markup commands in markdown * extend the markup-markdown language * Automatically identify cuda from nvidia-smi in install-requirements (pytorch#606) * Automatically identify cuda from nvidia-smi in install-requirements * Update README.md --------- Co-authored-by: Michael Gschwind <[email protected]> * Unbreak zero-temperature sampling (pytorch#599) Fixes pytorch#581. * Improve process README * [retake] Add sentencepiece tokenizer (pytorch#626) * Add sentencepiece tokenizer Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Add white space Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Handle white space: Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Handle control ids Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * More cleanup Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Lint Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Use unique_ptr Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Use a larger runner Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Debug Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Debug Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Cleanup * Update install_utils.sh to use python3 instead of python (pytorch#636) As titled. On some devices `python` and `python3` are pointing to different environments so good to unify them. * Fix quantization doc to specify dytpe limitation on a8w4dq (pytorch#629) Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Co-authored-by: Kimish Patel <[email protected]> * add desktop.json (pytorch#622) * add desktop.json * add fast * remove embedding * improvements * update readme from doc branch * tab/spc * fix errors in updown language * fix errors in updown language, and [skip]: begin/end * fix errors in updown language, and [skip]: begin/end * a storied run * stories run on readme instructions does not need HF token * increase timeout * check for hang un hf_login * executable README improvements * typo * typo --------- Co-authored-by: Ian Barber <[email protected]> Co-authored-by: Scott Wolchok <[email protected]> Co-authored-by: Mengwei Liu <[email protected]> Co-authored-by: Kimish Patel <[email protected]> Co-authored-by: Scott Roy <[email protected]>
yanbing-j · Jul 17, 2024 · 86c50df · 86c50df
1 parent 824a7ab
commit 86c50df
Show file tree

Hide file tree

Showing 6 changed files with 24 additions and 14 deletions.
diff --git a/.github/workflows/run-readme2.yml b/.github/workflows/run-readme2.yml
@@ -27,6 +27,7 @@ jobs:
 
         echo "::group::Create script"
         python3 scripts/process-readme.py > ./readme-commands.sh
+        echo "exit 1" >> ./readme-commands.sh
         echo "::endgroup::"
 
         echo "::group::Run This"

diff --git a/README.md b/README.md
@@ -80,15 +80,18 @@ HuggingFace.
 python3 torchchat.py download llama3
 ```
 
-*NOTE: This command may prompt you to request access to llama3 via HuggingFace, if you do not already have access. Simply follow the prompts and re-run the command when access is granted.*
+*NOTE: This command may prompt you to request access to llama3 via
+ HuggingFace, if you do not already have access. Simply follow the
+ prompts and re-run the command when access is granted.*
 
 View available models with:
 ```
 python3 torchchat.py list
 ```
 
+You can also remove downloaded models with the remove command:
+`python3 torchchat.py remove llama3`
 
-You can also remove downloaded models with the remove command: `python3 torchchat.py remove llama3`
 
 
 ## Running via PyTorch / Python
@@ -111,15 +114,15 @@ python3 torchchat.py generate llama3 --prompt "write me a story about a boy and
 
 For more information run `python3 torchchat.py generate --help`
 
-[end default]: 
 
 ### Browser
 
-[shell default]: if false; then
+[skip default]: begin
 ```
 python3 torchchat.py browser llama3
 ```
-[shell default]: fi
+[skip default]: end
+
 
 *Running on http://127.0.0.1:5000* should be printed out on the
  terminal. Click the link or go to
@@ -139,9 +142,15 @@ conversation.
 AOT compiles models before execution for faster inference
 
 The following example exports and executes the Llama3 8B Instruct
+<<<<<<< HEAD
+model.  The first command performs the actual export, the second
+command loads the exported model into the Python interface to enable
+users to test the exported model.
+=======
 model.  (The first command performs the actual export, the second
 command loads the exported model into the Python interface to enable
 users to test the exported model.)
+>>>>>>> cf83f45a1949b3d45e356d375486a4013badf4db
 
 ```
 # Compile
@@ -152,9 +161,10 @@ python3 torchchat.py export llama3 --output-dso-path exportedModels/llama3.so
 python3 torchchat.py generate llama3 --dso-path exportedModels/llama3.so --prompt "Hello my name is"
 ```
 
-NOTE: If you're machine has cuda add this flag for performance
+NOTE: If your machine has cuda add this flag for performance
 `--quantize config/data/cuda.json`
 
+[end default]: end
 ### Running native using our C++ Runner
 
 The end-to-end C++ [runner](runner/run.cpp) runs an `*.so` file
@@ -167,7 +177,7 @@ scripts/build_native.sh aoti
 
 Execute
 ```bash
-cmake-out/aoti_run exportedModels/llama3.so -z .model-artifacts/meta-llama/Meta-Llama-3-8B-Instruct/tokenizer.model -l 3 -i "Once upon a time"
+cmake-out/aoti_run exportedModels/llama3.so -z ~/.torchchat/model-cache/meta-llama/Meta-Llama-3-8B-Instruct/tokenizer.model -l 3 -i "Once upon a time"
 ```
 
 [end default]: 
@@ -243,9 +253,9 @@ Now, follow the app's UI guidelines to pick the model and tokenizer files from t
   <img src="https://pytorch.org/executorch/main/_static/img/llama_ios_app.png" width="600" alt="iOS app running a LlaMA model">
 </a>
 
-
 ### Deploy and run on Android
 
+MISSING. TBD.
 
 
 

diff --git a/build/utils.py b/build/utils.py
@@ -10,7 +10,6 @@
 import os
 from pathlib import Path
 from typing import Any, Callable, Dict, List, Tuple
-
 import torch
 
 ##########################################################################

diff --git a/config/data/desktop.json b/config/data/desktop.json
@@ -1,5 +1,4 @@
 {
-    "executor": {"accelerator": "fast" },
+    "executor": {"accelerator": "fast"},
     "precision": {"dtype" : "fast16"},
-    "linear:int4": {"groupsize" : 256}
 }
diff --git a/docs/quantization.md b/docs/quantization.md
@@ -22,7 +22,7 @@ Due to the larger vocabulary size of llama3, we also recommend quantizing the em
 |--|--|--|--|--|--|--|--|
 | embedding (symmetric) | fp32, fp16, bf16 | [8, 4]* | [32, 64, 128, 256]** | | ✅ | ✅ | ✅ |
 
-^ The a8w4dq quantization scheme requires inouts to be converted to fp32, due to lack of support for fp16 and bf16.
+^a8w4dq quantization scheme requires model to be converted to fp32, due to lack of support for fp16 and bf16 in the kernels provided with ExecuTorch.
 
 * These are the only valid bitwidth options.
 
@@ -82,7 +82,7 @@ python3 generate.py llama3 --dso-path llama3.dso  --prompt "Hello my name is"
 ```
 ### ExecuTorch
 ```
-python3 torchchat.py export llama3 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:a8w4dq": {"groupsize" : 256}}' --output-pte-path llama3.pte
+python3 torchchat.py export llama3 --dtype fp32 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:a8w4dq": {"groupsize" : 256}}' --output-pte-path llama3.pte
 
 python3 generate.py llama3 --pte-path llama3.pte  --prompt "Hello my name is"
 ```

diff --git a/scripts/process-readme.py b/scripts/process-readme.py
@@ -15,6 +15,7 @@ def print_between_triple_backticks(filename, predicate):
         elif line.startswith(command):
             print(line[len(command) :])
         elif line.startswith(end):
+            print("exit 0")
             return
         elif line.startswith(skip):
             keyword = line[len(skip):-1].strip()
@@ -34,6 +35,6 @@ def print_between_triple_backticks(filename, predicate):
 if len(sys.argv) > 1:
     predicate = sys.argv[1]
 else:
-    predicate = "default"
+    predicate="default"
 
 print_between_triple_backticks("README.md", predicate)