Skip to content

Commit 9152f0a

Browse files
authored
Export LLMs with Optimum docs (#15062)
1 parent d2aa141 commit 9152f0a

File tree

4 files changed

+179
-1
lines changed

4 files changed

+179
-1
lines changed
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Exporting LLMs with HuggingFace's Optimum ExecuTorch
2+
3+
[Optimum ExecuTorch](https://github.com/huggingface/optimum-executorch) provides a streamlined way to export Hugging Face transformer models to ExecuTorch format. It offers seamless integration with the Hugging Face ecosystem, making it easy to export models directly from the Hugging Face Hub.
4+
5+
## Overview
6+
7+
Optimum ExecuTorch supports a much wider variety of model architectures compared to ExecuTorch's native `export_llm` API. While `export_llm` focuses on a limited set of highly optimized models (Llama, Qwen, Phi, and SmolLM) with advanced features like SpinQuant and attention sink, Optimum ExecuTorch can export diverse architectures including Gemma, Mistral, GPT-2, BERT, T5, Whisper, Voxtral, and many others.
8+
9+
### Use Optimum ExecuTorch when:
10+
- You need to export models beyond the limited set supported by `export_llm`
11+
- Exporting directly from Hugging Face Hub model IDs, including model variants such as finetunes
12+
- You want a simpler interface with Hugging Face ecosystem integration
13+
14+
### Use export_llm when:
15+
- Working with one of the highly optimized supported models (Llama, Qwen, Phi, SmolLM)
16+
- You need advanced optimizations like SpinQuant or attention sink
17+
- You need pt2e quantization for QNN/CoreML/Vulkan backends
18+
- Working with Llama models requiring custom checkpoints
19+
20+
See [Exporting LLMs](export-llm.md) for details on using the native `export_llm` API.
21+
22+
## Prerequisites
23+
24+
### Installation
25+
26+
First, clone and install Optimum ExecuTorch from source:
27+
28+
```bash
29+
git clone https://github.com/huggingface/optimum-executorch.git
30+
cd optimum-executorch
31+
pip install '.[dev]'
32+
```
33+
34+
For access to the latest features and optimizations, install dependencies in dev mode:
35+
36+
```bash
37+
python install_dev.py
38+
```
39+
40+
This installs `executorch`, `torch`, `torchao`, `transformers`, and other dependencies from nightly builds or source.
41+
42+
## Supported Models
43+
44+
Optimum ExecuTorch supports a wide range of model architectures including decoder-only LLMs (Llama, Qwen, Gemma, Mistral, etc.), multimodal models, vision models, audio models (Whisper), encoder models (BERT, RoBERTa), and seq2seq models (T5).
45+
46+
For the complete list of supported models, see the [Optimum ExecuTorch documentation](https://github.com/huggingface/optimum-executorch#-supported-models).
47+
48+
## Export Methods
49+
50+
Optimum ExecuTorch offers two ways to export models:
51+
52+
### Method 1: CLI Export
53+
54+
The CLI is the simplest way to export models. It provides a single command to convert models from Hugging Face Hub to ExecuTorch format.
55+
56+
#### Basic Export
57+
58+
```bash
59+
optimum-cli export executorch \
60+
--model "HuggingFaceTB/SmolLM2-135M-Instruct" \
61+
--task "text-generation" \
62+
--recipe "xnnpack" \
63+
--output_dir="./smollm2_exported"
64+
```
65+
66+
#### With Optimizations
67+
68+
Add custom SDPA, KV cache optimization, and quantization:
69+
70+
```bash
71+
optimum-cli export executorch \
72+
--model "HuggingFaceTB/SmolLM2-135M-Instruct" \
73+
--task "text-generation" \
74+
--recipe "xnnpack" \
75+
--use_custom_sdpa \
76+
--use_custom_kv_cache \
77+
--qlinear 8da4w \
78+
--qembedding 8w \
79+
--output_dir="./smollm2_exported"
80+
```
81+
82+
#### Available CLI Arguments
83+
84+
Key arguments for LLM export include `--model`, `--task`, `--recipe` (backend), `--use_custom_sdpa`, `--use_custom_kv_cache`, `--qlinear` (linear quantization), `--qembedding` (embedding quantization), and `--max_seq_len`.
85+
86+
For the complete list of arguments, run:
87+
```bash
88+
optimum-cli export executorch --help
89+
```
90+
91+
## Optimization Options
92+
93+
### Custom Operators
94+
95+
Optimum ExecuTorch includes custom SDPA (~3x speedup) and custom KV cache (~2.5x speedup) operators. Enable with `--use_custom_sdpa` and `--use_custom_kv_cache`.
96+
97+
### Quantization
98+
99+
Optimum ExecuTorch uses [TorchAO](https://github.com/pytorch/ao) for quantization. Common options:
100+
- `--qlinear 8da4w`: int8 dynamic activation + int4 weight (recommended)
101+
- `--qembedding 4w` or `--qembedding 8w`: int4/int8 embedding quantization
102+
103+
Example:
104+
```bash
105+
optimum-cli export executorch \
106+
--model "meta-llama/Llama-3.2-1B" \
107+
--task "text-generation" \
108+
--recipe "xnnpack" \
109+
--use_custom_sdpa \
110+
--use_custom_kv_cache \
111+
--qlinear 8da4w \
112+
--qembedding 4w \
113+
--output_dir="./llama32_1b"
114+
```
115+
116+
### Backend Support
117+
118+
Supported backends: `xnnpack` (CPU), `coreml` (Apple GPU), `portable` (baseline), `cuda` (NVIDIA GPU). Specify with `--recipe`.
119+
120+
## Exporting Different Model Types
121+
122+
Optimum ExecuTorch supports various model architectures with different tasks:
123+
124+
- **Decoder-only LLMs**: Use `--task text-generation`
125+
- **Multimodal LLMs**: Use `--task multimodal-text-to-text`
126+
- **Seq2Seq models** (T5): Use `--task text2text-generation`
127+
- **ASR models** (Whisper): Use `--task automatic-speech-recognition`
128+
129+
For detailed examples of exporting each model type, see the [Optimum ExecuTorch export guide](https://github.com/huggingface/optimum-executorch/blob/main/optimum/exporters/executorch/README.md).
130+
131+
## Running Exported Models
132+
133+
### Verifying Output with Python
134+
135+
After exporting, you can verify the model output in Python before deploying to device using classes from `modeling.py`, such as the `ExecuTorchModelForCausalLM` class for LLMs:
136+
137+
```python
138+
from optimum.executorch import ExecuTorchModelForCausalLM
139+
from transformers import AutoTokenizer
140+
141+
# Load the exported model
142+
model = ExecuTorchModelForCausalLM.from_pretrained("./smollm2_exported")
143+
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")
144+
145+
# Generate text
146+
generated_text = model.text_generation(
147+
tokenizer=tokenizer,
148+
prompt="Once upon a time",
149+
max_seq_len=128,
150+
)
151+
print(generated_text)
152+
```
153+
154+
### Running on Device
155+
156+
After verifying your model works correctly, deploy it to device:
157+
158+
- [Running with C++](run-with-c-plus-plus.md) - Run exported models using ExecuTorch's C++ runtime
159+
- [Running on Android](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android) - Deploy to Android devices
160+
- [Running on iOS](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) - Deploy to iOS devices
161+
162+
## Performance
163+
164+
For performance benchmarks and on-device metrics, see the [Optimum ExecuTorch benchmarks](https://github.com/huggingface/optimum-executorch#-benchmarks-on-mobile-devices) and the [ExecuTorch Benchmark Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fexecutorch).
165+
166+
## Additional Resources
167+
168+
- [Optimum ExecuTorch GitHub](https://github.com/huggingface/optimum-executorch) - Full documentation and examples
169+
- [Supported Models](https://github.com/huggingface/optimum-executorch#-supported-models) - Complete model list
170+
- [Export Guide](https://github.com/huggingface/optimum-executorch/blob/main/optimum/exporters/executorch/README.md) - Detailed export examples
171+
- [TorchAO Quantization](https://github.com/pytorch/ao) - Quantization library documentation

docs/source/llm/export-llm.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ As of this doc, the list of supported LLMs include the following:
2020

2121
The up-to-date list of supported LLMs can be found in the code [here](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L32).
2222

23+
**Note:** If you need to export models that are not on this list or other model architectures (such as Gemma, Mistral, BERT, T5, Whisper, etc.), see [Exporting LLMs with Optimum](export-llm-optimum.md) which supports a much wider variety of models from Hugging Face Hub.
24+
2325
## The export_llm API
2426
`export_llm` is ExecuTorch's high-level export API for LLMs. In this tutorial, we will focus on exporting Llama 3.2 1B using this API. `export_llm`'s arguments are specified either through CLI args or through a yaml configuration whose fields are defined in [`LlmConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py). To call `export_llm`:
2527

docs/source/llm/getting-started.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,12 @@ To follow this guide, you'll need to install ExecuTorch. Please see [Setting Up
1818

1919
Deploying LLMs to ExecuTorch can be boiled down to a two-step process: (1) exporting the LLM to a `.pte` file and (2) running the `.pte` file using our C++ APIs or Swift/Java bindings.
2020

21-
- [Exporting LLMs](export-llm.md)
21+
### Exporting
22+
- [Exporting LLMs](export-llm.md) - Export using ExecuTorch's native `export_llm` API with advanced optimizations
23+
- [Exporting LLMs with Optimum](export-llm-optimum.md) - Export Hugging Face models with broader architecture support
2224
- [Exporting custom LLMs](export-custom-llm.md)
25+
26+
### Running
2327
- [Running with C++](run-with-c-plus-plus.md)
2428
- [Running on Android (XNNPack)](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android)
2529
- [Running on Android (Qualcomm)](build-run-llama3-qualcomm-ai-engine-direct-backend.md)

docs/source/llm/working-with-llms.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Learn how to export LLM models and deploy them across different platforms and ru
1111
1212
getting-started
1313
export-llm
14+
export-llm-optimum
1415
export-custom-llm
1516
run-with-c-plus-plus
1617
build-run-llama3-qualcomm-ai-engine-direct-backend

0 commit comments

Comments
 (0)