Skip to content

Commit 307a295

Browse files
Gasoonjiafacebook-github-bot
authored andcommitted
add readme for gemma3 cuda
Summary: creates a readme for exporting and running gemma3 on cuda backend Differential Revision: D85220287
1 parent 81a3acc commit 307a295

File tree

1 file changed

+127
-0
lines changed

1 file changed

+127
-0
lines changed

examples/models/gemma3/README.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# Summary
2+
3+
This example demonstrates how to export and run Google's [Gemma 3](https://huggingface.co/google/gemma-3-4b-it) vision-language multimodal model locally on ExecuTorch with CUDA backend support.
4+
5+
# Exporting the model
6+
To export the model, we use [Optimum ExecuTorch](https://github.com/huggingface/optimum-executorch), a repo that enables exporting models straight from the source - from HuggingFace's Transformers repo.
7+
8+
## Setting up Optimum ExecuTorch
9+
Install through pip package:
10+
```
11+
pip install optimum-executorch
12+
```
13+
14+
Or install from source:
15+
```
16+
git clone https://github.com/huggingface/optimum-executorch.git
17+
cd optimum-executorch
18+
python install_dev.py
19+
```
20+
21+
## CUDA Support
22+
This guide focuses on CUDA backend support for Gemma3, which provides accelerated performance on NVIDIA GPUs.
23+
24+
### Exporting with CUDA
25+
```bash
26+
optimum-cli export executorch \
27+
--model "google/gemma-3-4b-it" \
28+
--task "multimodal-text-to-text" \
29+
--recipe "cuda" \
30+
--dtype bfloat16 \
31+
--device cuda \
32+
--output_dir="path/to/output/dir"
33+
```
34+
35+
This will generate:
36+
- `model.pte` - The exported model
37+
- `aoti_cuda_blob.ptd` - The CUDA kernel blob required for runtime
38+
39+
### Exporting with INT4 Quantization (Tile Packed)
40+
For improved performance and reduced memory footprint, you can export Gemma3 with INT4 weight quantization using tile-packed format:
41+
42+
```bash
43+
optimum-cli export executorch \
44+
--model "google/gemma-3-4b-it" \
45+
--task "multimodal-text-to-text" \
46+
--recipe "cuda" \
47+
--dtype bfloat16 \
48+
--device cuda \
49+
--qlinear 4w \
50+
--qlinear_encoder 4w \
51+
--qlinear_packing_format tile_packed_to_4d \
52+
--qlinear_encoder_packing_format tile_packed_to_4d \
53+
--output_dir="path/to/output/dir"
54+
```
55+
56+
This will generate the same files (`model.pte` and `aoti_cuda_blob.ptd`) in the `int4` directory.
57+
58+
See the "Building the Gemma3 runner" section below for instructions on building with CUDA support, and the "Running the model" section for runtime instructions.
59+
60+
# Running the model
61+
To run the model, we will use the Gemma3 runner, which utilizes ExecuTorch's MultiModal runner API.
62+
The Gemma3 runner will do the following:
63+
64+
- **Image Input**: Load image files (PNG, JPG, etc.) and format them as input tensors for the model
65+
- **Text Input**: Process text prompts using the tokenizer
66+
- **Feed the formatted inputs** to the multimodal runner for inference
67+
68+
## Obtaining the tokenizer
69+
You can download the `tokenizer.json` file from [Gemma 3's HuggingFace repo](https://huggingface.co/unsloth/gemma-3-1b-it):
70+
```bash
71+
curl -L https://huggingface.co/unsloth/gemma-3-1b-it/resolve/main/tokenizer.json -o tokenizer.json
72+
```
73+
74+
## Building the Gemma3 runner
75+
76+
### Prerequisites
77+
Ensure you have a CUDA-capable GPU and CUDA toolkit installed on your system.
78+
79+
### Building for CUDA
80+
```bash
81+
# Install ExecuTorch.
82+
./install_executorch.sh
83+
84+
# Build the multimodal runner with CUDA
85+
cmake --preset llm \
86+
-DEXECUTORCH_BUILD_CUDA=ON \
87+
-DCMAKE_INSTALL_PREFIX=cmake-out \
88+
-DCMAKE_BUILD_TYPE=Release \
89+
-Bcmake-out -S.
90+
cmake --build cmake-out -j$(nproc) --target install --config Release
91+
92+
# Build the Gemma3 runner
93+
cmake -DEXECUTORCH_BUILD_CUDA=ON \
94+
-DCMAKE_BUILD_TYPE=Release \
95+
-Sexamples/models/gemma3 \
96+
-Bcmake-out/examples/models/gemma3/
97+
cmake --build cmake-out/examples/models/gemma3 --target gemma3_e2e_runner --config Release
98+
```
99+
100+
## Running the model
101+
You need to provide the following files to run Gemma3:
102+
- `model.pte` - The exported model file
103+
- `aoti_cuda_blob.ptd` - The CUDA kernel blob
104+
- `tokenizer.json` - The tokenizer file
105+
- An image file (PNG, JPG, etc.)
106+
107+
### Example usage
108+
```bash
109+
./cmake-out/examples/models/gemma3/gemma3_e2e_runner \
110+
--model_path path/to/model.pte \
111+
--data_path path/to/aoti_cuda_blob.ptd \
112+
--tokenizer_path path/to/tokenizer.json \
113+
--image_path docs/source/_static/img/et-logo.png \ # here we use the ExecuTorch logo as an example
114+
--temperature 0
115+
```
116+
117+
# Example output
118+
```
119+
Okay, let's break down what's in the image!
120+
121+
It appears to be a stylized graphic combining:
122+
123+
* **A Microchip:** The core shape is a representation of a microchip (the integrated circuit).
124+
* **An "On" Symbol:** There's an "On" symbol (often represented as a circle with a vertical line) incorporated into the microchip design.
125+
* **Color Scheme:** The microchip is colored in gray, and
126+
PyTorchObserver {"prompt_tokens":271,"generated_tokens":99,"model_load_start_ms":0,"model_load_end_ms":0,"inference_start_ms":1761118126790,"inference_end_ms":1761118128385,"prompt_eval_end_ms":1761118127175,"first_token_ms":1761118127175,"aggregate_sampling_time_ms":86,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
127+
```

0 commit comments

Comments
 (0)