Skip to content

Commit bdc526b

Browse files
Qualcomm AI Engine Direct - change the llama tutorial to static llama version (#14887)
### Summary change the llama tutorial to static llama version cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @cbilgin
1 parent a509431 commit bdc526b

File tree

2 files changed

+60
-105
lines changed

2 files changed

+60
-105
lines changed

docs/source/backends-qualcomm.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -397,4 +397,4 @@ print(f"Model successfully exported to {model_name}")
397397
## FAQ
398398

399399
If you encounter any issues while reproducing the tutorial, please file a github
400-
issue on ExecuTorch repo and tag use `#qcom_aisw` tag
400+
[issue](https://github.com/pytorch/executorch/issues) on ExecuTorch repo and tag use `#qcom_aisw` tag
Lines changed: 59 additions & 104 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
1-
# Run Llama 3 8B on Android (with Qualcomm AI Engine Direct Backend)
1+
# Run Llama 3 3B Instruct on Android (with Qualcomm AI Engine Direct Backend)
22

3-
This tutorial demonstrates how to export Llama 3 8B Instruct for Qualcomm AI Engine Direct Backend and running the model on a Qualcomm device.
3+
This tutorial demonstrates how to export and run the Llama 3 3B Instruct model on a Qualcomm device using the Qualcomm AI Engine Direct Backend via ExecuTorch.
4+
We use a static Llama [implementation](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/model/static_llama.py) to optimize performance and memory usage during on-device inference.
45

56
## Prerequisites
67

@@ -13,10 +14,8 @@ This tutorial demonstrates how to export Llama 3 8B Instruct for Qualcomm AI Eng
1314

1415
## Instructions
1516

16-
### Step 1: Prepare the checkpoint of the model and optimized matrix from [Spin Quant](https://github.com/facebookresearch/SpinQuant)
17-
18-
1. For Llama 3 tokenizer and checkpoint, please refer to https://github.com/meta-llama/llama-models/blob/main/README.md for further instructions on how to download `tokenizer.model`, `consolidated.00.pth` and `params.json`.
19-
2. To get the optimized matrix, please refer to [SpinQuant on GitHub](https://github.com/facebookresearch/SpinQuant). You can download the optimized rotation matrices in the Quantized Models section. Please choose **LLaMA-3-8B/8B_W4A16KV16_lr_1.5_seed_0**.
17+
### Step 1: Prepare the checkpoint and tokenizer of the model.
18+
1. For Llama 3 tokenizer and checkpoint, please refer to [instructions](https://www.llama.com/models/llama-3) for further instructions on how to download `tokenizer.model`, `consolidated.00.pth` and `params.json`.
2019

2120
### Step 2: Export to ExecuTorch with Qualcomm AI Engine Direct Backend
2221
Deploying large language models like Llama 3 on-device presents the following challenges:
@@ -25,123 +24,79 @@ Deploying large language models like Llama 3 on-device presents the following ch
2524
2. High model loading and inference time.
2625
3. Difficulty in quantization.
2726

28-
To address these challenges, we have implemented the following solutions:
29-
1. Using `quantization.pt2e_quantize = "qnn_16a4w'` to quantize activations and weights, thereby reducing the on-disk model size and alleviating memory pressure during inference.
30-
2. Using `backed.qnn.num_sharding = 8` to shard the model into sub-parts.
31-
3. Performing graph transformations to convert or decompose operations into more accelerator-friendly operations.
32-
4. Using `backend.qnn.optimized_rotation_path = "<path_to_optimized_matrix>"` to apply R1 and R2 of [Spin Quant](https://github.com/facebookresearch/SpinQuant) to improve accuracy.
33-
5. Using `quantization.calibration_data = "<|start_header_id|>system<|end_header_id|..."` to ensure that during quantization, the calibration includes special tokens in the prompt template. For more details on the prompt template, refer to [the model card](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/).
27+
To address these, we apply the following optimizations:
28+
29+
1. Quantization: Use `QuantDtype.use_16a4w_block` for post-training quantization to reduce model size and memory usage.
30+
31+
2. Mixed Precision Quantization: compresses KV cache tensors to 8-bit and applies `QuantDtype.use_16a8w` to the LM head.
32+
33+
3. Model Sharding: Set `num_sharding` = 4 to shard the model into sub-parts. This helps reduce memory pressure and improve performance during on-device inference. The number of shards might be different depending on the model size.
34+
35+
4. Graph Transformations: Convert operations into accelerator-friendly formats for better runtime performance.
36+
37+
You can find the full optimization configuration in this [file](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/__init__.py), as shown below:
38+
39+
``` python
40+
@register_llm_model("llama3_2-3b_instruct")
41+
@dataclass(init=False, frozen=True)
42+
class Llama3_2_3B_Instruct(LLMModelConfig):
43+
repo_id = None
44+
params_path = None
45+
convert_weights = None
46+
transform_weight = True
47+
# The Llama3_2 enabled should be instruct, however, Llama's tokenizer does not provide utility to apply chat template.
48+
instruct_model = False
49+
50+
num_sharding = 4
51+
# quant config
52+
ptq = QuantDtype.use_16a4w_block
53+
group_size = 32 # Group size used in block quantization for weight quantization. Will only be used when ptq = 16a4w_block
54+
masked_softmax = False
55+
56+
# SeqMSE Quantization: optimizes the parameter encodings of each layer of a model individually to minimize the difference between the layer’s original and quantized outputs. (Implementation details: ./backends/qualcomm/_passes/seq_mse.py) In this configuration, we set `seq_mse_candidates` = 0, which means SeqMSE quantization is not applied.
57+
seq_mse_candidates = 0
58+
r1 = False
59+
r2 = False
60+
r3 = False
61+
custom_annotation = (
62+
annotate_kv_8bit,
63+
annotate_output_16a8w,
64+
)
65+
```
66+
3467

3568
To export with the Qualcomm AI Engine Direct Backend, ensure the following:
3669

37-
1. The host machine has more than 100GB of memory (RAM + swap space).
70+
1. The host machine has more than 64GB of memory (RAM + swap space).
3871
2. The entire process takes a few hours.
3972

4073
```bash
41-
# path/to/config.yaml
42-
base:
43-
model_class: llama3
44-
checkpoint: path/to/consolidated.00.pth
45-
params: path/to/params.json
46-
tokenizer_path: path/to/tokenizer.model
47-
metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
48-
model:
49-
use_kv_cache: True
50-
enable_dynamic_shape: False
51-
quantization:
52-
pt2e_quantize: qnn_16a4w
53-
# Please note that calibration_data must include the prompt template for special tokens.
54-
calibration_data: "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
55-
backend:
56-
qnn:
57-
enabled: True
58-
num_sharding: 8
59-
60-
61-
# export_llm
62-
python -m extension.llm.export.export_llm \
63-
--config path/to/config.yaml
74+
# export llama
75+
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-3b_instruct --model_mode kv --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1 --compile_only
6476
```
77+
Note: end-to-end [instructions](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/README.md)
6578

6679
### Step 3: Invoke the Runtime on an Android smartphone with Qualcomm SoCs
67-
1. Build executorch with Qualcomm AI Engine Direct Backend for android
68-
```bash
69-
cmake \
70-
-DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake" \
71-
-DANDROID_ABI=arm64-v8a \
72-
-DCMAKE_INSTALL_PREFIX=cmake-android-out \
73-
-DCMAKE_BUILD_TYPE=Release \
74-
-DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
75-
-DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
76-
-DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
77-
-DEXECUTORCH_BUILD_QNN=ON \
78-
-DQNN_SDK_ROOT=${QNN_SDK_ROOT} \
79-
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
80-
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
81-
-DEXECUTORCH_BUILD_KERNELS_LLM=ON \
82-
-Bcmake-android-out .
83-
84-
cmake --build cmake-android-out -j16 --target install --config Release
85-
```
86-
2. Build llama runner for android
87-
```bash
88-
cmake \
89-
-DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK_ROOT}"/build/cmake/android.toolchain.cmake \
90-
-DANDROID_ABI=arm64-v8a \
91-
-DCMAKE_INSTALL_PREFIX=cmake-android-out \
92-
-DCMAKE_BUILD_TYPE=Release -DPYTHON_EXECUTABLE=python \
93-
-DEXECUTORCH_BUILD_QNN=ON \
94-
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
95-
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
96-
-DEXECUTORCH_BUILD_KERNELS_LLM=ON \
97-
-Bcmake-android-out/examples/models/llama examples/models/llama
98-
99-
cmake --build cmake-android-out/examples/models/llama -j16 --config Release
100-
```
101-
3. Run on Android via adb shell
102-
*Pre-requisite*: Make sure you enable USB debugging via developer options on your phone
103-
10480
**3.1 Connect your android phone**
10581

106-
**3.2 We need to push required QNN libraries to the device.**
107-
```bash
108-
# make sure you have write-permission on below path.
109-
DEVICE_DIR=/data/local/tmp/llama
110-
adb shell mkdir -p ${DEVICE_DIR}
111-
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtp.so ${DEVICE_DIR}
112-
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnSystem.so ${DEVICE_DIR}
113-
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV69Stub.so ${DEVICE_DIR}
114-
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV73Stub.so ${DEVICE_DIR}
115-
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV75Stub.so ${DEVICE_DIR}
116-
adb push ${QNN_SDK_ROOT}/lib/hexagon-v69/unsigned/libQnnHtpV69Skel.so ${DEVICE_DIR}
117-
adb push ${QNN_SDK_ROOT}/lib/hexagon-v73/unsigned/libQnnHtpV73Skel.so ${DEVICE_DIR}
118-
adb push ${QNN_SDK_ROOT}/lib/hexagon-v75/unsigned/libQnnHtpV75Skel.so ${DEVICE_DIR}
119-
```
82+
**3.2 Make sure the following artifact is present before running the model.**
83+
-- artifact/
84+
└── llama_qnn.pte
12085

121-
**3.3 Upload model, tokenizer and llama runner binary to phone**
86+
**3.3 Run model**
12287
```bash
123-
adb push <model.pte> ${DEVICE_DIR}
124-
adb push <tokenizer.model> ${DEVICE_DIR}
125-
adb push cmake-android-out/lib/libqnn_executorch_backend.so ${DEVICE_DIR}
126-
adb push cmake-out-android/examples/models/llama/llama_main ${DEVICE_DIR}
127-
```
128-
129-
**3.4 Run model**
130-
```bash
131-
adb shell "cd ${DEVICE_DIR} && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.model> --prompt \"<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n\" --seq_len 128"
132-
```
133-
You should see the message:
134-
```
135-
<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello! I'd be delighted to chat with you about Facebook. Facebook is a social media platform that was created in 2004 by Mark Zuckerberg and his colleagues while he was a student at Harvard University. It was initially called "Facemaker" but later changed to Facebook, which is a combination of the words "face" and "book". The platform was initially intended for people to share their thoughts and share information with their friends, but it quickly grew to become one of the
88+
# Run llama
89+
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s ${SERIAL_NUM} -m ${SOC_MODEL} --checkpoint consolidated.00.pth --params params.json --tokenizer_model tokenizer.model --decoder_model llama3_2-3b_instruct --model_mode kv --max_seq_len 1024 --prompt "I would like to learn python, could you teach me with a simple example?" --tasks wikitext --limit 1 --pre_gen_pte ${PATH_TO_ARTIFACT}
13690
```
13791

13892
## What is coming?
139-
14093
- Performance improvements
14194
- Reduce the memory pressure during inference to support 12GB Qualcomm devices
142-
- Support more LLMs (Qwen, Phi-4-mini, etc.)
95+
- Broader LLM Support via [Optimum ExecuTorch](https://github.com/huggingface/optimum-executorch?tab=readme-ov-file#llms-large-language-models)
96+
97+
- Already supported models (e.g.): Llama2, Llama3, Gemma, Qwen, Phi-4, SmolLM. For usage examples, please refer to [README](https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/README.md)
14398

14499
## FAQ
145100

146101
If you encounter any issues while reproducing the tutorial, please file a github
147-
issue on ExecuTorch repo and tag use `#qcom_aisw` tag
102+
[issue](https://github.com/pytorch/executorch/issues) on ExecuTorch repo and tag use `#qcom_aisw` tag

0 commit comments

Comments
 (0)