Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 171 additions & 0 deletions docs/source/llm/export-llm-optimum.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Exporting LLMs with HuggingFace's Optimum ExecuTorch

[Optimum ExecuTorch](https://github.com/huggingface/optimum-executorch) provides a streamlined way to export Hugging Face transformer models to ExecuTorch format. It offers seamless integration with the Hugging Face ecosystem, making it easy to export models directly from the Hugging Face Hub.

## Overview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add here or somewhere obvious that optimum-executorch is undergoing active development so proceed with caution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thinks it is stable enough to not need to add this, additionally I don't see what the benefit of saying this is


Optimum ExecuTorch supports a much wider variety of model architectures compared to ExecuTorch's native `export_llm` API. While `export_llm` focuses on a limited set of highly optimized models (Llama, Qwen, Phi, and SmolLM) with advanced features like SpinQuant and attention sink, Optimum ExecuTorch can export diverse architectures including Gemma, Mistral, GPT-2, BERT, T5, Whisper, Voxtral, and many others.

### Use Optimum ExecuTorch when:
- You need to export models beyond the limited set supported by `export_llm`
- Exporting directly from Hugging Face Hub model IDs, including model variants such as finetunes
- You want a simpler interface with Hugging Face ecosystem integration

### Use export_llm when:
- Working with one of the highly optimized supported models (Llama, Qwen, Phi, SmolLM)
- You need advanced optimizations like SpinQuant or attention sink
- You need pt2e quantization for QNN/CoreML/Vulkan backends
- Working with Llama models requiring custom checkpoints

See [Exporting LLMs](export-llm.md) for details on using the native `export_llm` API.

## Prerequisites

### Installation

First, clone and install Optimum ExecuTorch from source:

```bash
git clone https://github.com/huggingface/optimum-executorch.git
cd optimum-executorch
pip install '.[dev]'
```

For access to the latest features and optimizations, install dependencies in dev mode:

```bash
python install_dev.py
```

This installs `executorch`, `torch`, `torchao`, `transformers`, and other dependencies from nightly builds or source.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that it re-installs executorch torch and torchao

Can we provide an option to build with no-isolation mode>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make separate PR


## Supported Models

Optimum ExecuTorch supports a wide range of model architectures including decoder-only LLMs (Llama, Qwen, Gemma, Mistral, etc.), multimodal models, vision models, audio models (Whisper), encoder models (BERT, RoBERTa), and seq2seq models (T5).

For the complete list of supported models, see the [Optimum ExecuTorch documentation](https://github.com/huggingface/optimum-executorch#-supported-models).

## Export Methods

Optimum ExecuTorch offers two ways to export models:

### Method 1: CLI Export

The CLI is the simplest way to export models. It provides a single command to convert models from Hugging Face Hub to ExecuTorch format.

#### Basic Export

```bash
optimum-cli export executorch \
--model "HuggingFaceTB/SmolLM2-135M-Instruct" \
--task "text-generation" \
--recipe "xnnpack" \
--output_dir="./smollm2_exported"
```

#### With Optimizations

Add custom SDPA, KV cache optimization, and quantization:

```bash
optimum-cli export executorch \
--model "HuggingFaceTB/SmolLM2-135M-Instruct" \
--task "text-generation" \
--recipe "xnnpack" \
--use_custom_sdpa \
--use_custom_kv_cache \
--qlinear 8da4w \
--qembedding 8w \
--output_dir="./smollm2_exported"
```

#### Available CLI Arguments

Key arguments for LLM export include `--model`, `--task`, `--recipe` (backend), `--use_custom_sdpa`, `--use_custom_kv_cache`, `--qlinear` (linear quantization), `--qembedding` (embedding quantization), and `--max_seq_len`.

For the complete list of arguments, run:
```bash
optimum-cli export executorch --help
```

## Optimization Options

### Custom Operators

Optimum ExecuTorch includes custom SDPA (~3x speedup) and custom KV cache (~2.5x speedup) operators. Enable with `--use_custom_sdpa` and `--use_custom_kv_cache`.

### Quantization

Optimum ExecuTorch uses [TorchAO](https://github.com/pytorch/ao) for quantization. Common options:
- `--qlinear 8da4w`: int8 dynamic activation + int4 weight (recommended)
- `--qembedding 4w` or `--qembedding 8w`: int4/int8 embedding quantization

Example:
```bash
optimum-cli export executorch \
--model "meta-llama/Llama-3.2-1B" \
--task "text-generation" \
--recipe "xnnpack" \
--use_custom_sdpa \
--use_custom_kv_cache \
--qlinear 8da4w \
--qembedding 4w \
--output_dir="./llama32_1b"
```

### Backend Support

Supported backends: `xnnpack` (CPU), `coreml` (Apple GPU), `portable` (baseline), `cuda` (NVIDIA GPU). Specify with `--recipe`.

## Exporting Different Model Types

Optimum ExecuTorch supports various model architectures with different tasks:

- **Decoder-only LLMs**: Use `--task text-generation`
- **Multimodal LLMs**: Use `--task multimodal-text-to-text`
- **Seq2Seq models** (T5): Use `--task text2text-generation`
- **ASR models** (Whisper): Use `--task automatic-speech-recognition`

For detailed examples of exporting each model type, see the [Optimum ExecuTorch export guide](https://github.com/huggingface/optimum-executorch/blob/main/optimum/exporters/executorch/README.md).

## Running Exported Models

### Verifying Output with Python

After exporting, you can verify the model output in Python before deploying to device using classes from `modeling.py`, such as the `ExecuTorchModelForCausalLM` class for LLMs:

```python
from optimum.executorch import ExecuTorchModelForCausalLM
from transformers import AutoTokenizer

# Load the exported model
model = ExecuTorchModelForCausalLM.from_pretrained("./smollm2_exported")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")

# Generate text
generated_text = model.text_generation(
tokenizer=tokenizer,
prompt="Once upon a time",
max_seq_len=128,
)
print(generated_text)
```

### Running on Device

After verifying your model works correctly, deploy it to device:

- [Running with C++](run-with-c-plus-plus.md) - Run exported models using ExecuTorch's C++ runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do .html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdym?

- [Running on Android](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android) - Deploy to Android devices
- [Running on iOS](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) - Deploy to iOS devices

## Performance

For performance benchmarks and on-device metrics, see the [Optimum ExecuTorch benchmarks](https://github.com/huggingface/optimum-executorch#-benchmarks-on-mobile-devices) and the [ExecuTorch Benchmark Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fexecutorch).

## Additional Resources

- [Optimum ExecuTorch GitHub](https://github.com/huggingface/optimum-executorch) - Full documentation and examples
- [Supported Models](https://github.com/huggingface/optimum-executorch#-supported-models) - Complete model list
- [Export Guide](https://github.com/huggingface/optimum-executorch/blob/main/optimum/exporters/executorch/README.md) - Detailed export examples
- [TorchAO Quantization](https://github.com/pytorch/ao) - Quantization library documentation
2 changes: 2 additions & 0 deletions docs/source/llm/export-llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ As of this doc, the list of supported LLMs include the following:

The up-to-date list of supported LLMs can be found in the code [here](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L32).

**Note:** If you need to export models that are not on this list or other model architectures (such as Gemma, Mistral, BERT, T5, Whisper, etc.), see [Exporting LLMs with Optimum](export-llm-optimum.md) which supports a much wider variety of models from Hugging Face Hub.

## The export_llm API
`export_llm` is ExecuTorch's high-level export API for LLMs. In this tutorial, we will focus on exporting Llama 3.2 1B using this API. `export_llm`'s arguments are specified either through CLI args or through a yaml configuration whose fields are defined in [`LlmConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py). To call `export_llm`:

Expand Down
6 changes: 5 additions & 1 deletion docs/source/llm/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,12 @@ To follow this guide, you'll need to install ExecuTorch. Please see [Setting Up

Deploying LLMs to ExecuTorch can be boiled down to a two-step process: (1) exporting the LLM to a `.pte` file and (2) running the `.pte` file using our C++ APIs or Swift/Java bindings.

- [Exporting LLMs](export-llm.md)
### Exporting
- [Exporting LLMs](export-llm.md) - Export using ExecuTorch's native `export_llm` API with advanced optimizations
- [Exporting LLMs with Optimum](export-llm-optimum.md) - Export Hugging Face models with broader architecture support
- [Exporting custom LLMs](export-custom-llm.md)

### Running
- [Running with C++](run-with-c-plus-plus.md)
- [Running on Android (XNNPack)](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android)
- [Running on Android (Qualcomm)](build-run-llama3-qualcomm-ai-engine-direct-backend.md)
Expand Down
1 change: 1 addition & 0 deletions docs/source/llm/working-with-llms.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Learn how to export LLM models and deploy them across different platforms and ru

getting-started
export-llm
export-llm-optimum
export-custom-llm
run-with-c-plus-plus
build-run-llama3-qualcomm-ai-engine-direct-backend
Expand Down
Loading