Skip to content

Commit db8eecd

Browse files
Add Metal backend documentation to Voxtral README
1 parent 7d4db45 commit db8eecd

File tree

1 file changed

+62
-7
lines changed

1 file changed

+62
-7
lines changed

examples/models/voxtral/README.md

Lines changed: 62 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,26 @@ optimum-cli export executorch \
7474

7575
See the "Building the multimodal runner" section below for instructions on building with CUDA support, and the "Running the model" section for runtime instructions.
7676

77+
## Metal Support
78+
On Apple Silicon, you can enable the runner to run on Metal. Follow the export and runtime commands below:
79+
80+
### Exporting with Metal
81+
```
82+
optimum-cli export executorch \
83+
--model "mistralai/Voxtral-Mini-3B-2507" \
84+
--task "multimodal-text-to-text" \
85+
--recipe "metal" \
86+
--dtype bfloat16 \
87+
--max_seq_len 1024 \
88+
--output_dir="voxtral"
89+
```
90+
91+
This will generate:
92+
- `model.pte` - The exported model
93+
- `aoti_metal_blob.ptd` - The Metal kernel blob required for runtime
94+
95+
See the "Building the multimodal runner" section below for instructions on building with Metal support, and the "Running the model" section for runtime instructions.
96+
7797
# Running the model
7898
To run the model, we will use the Voxtral runner, which utilizes ExecuTorch's MultiModal runner API.
7999
The Voxtral runner will do the following things:
@@ -90,7 +110,12 @@ We provide a simple way to transform raw audio data into a mel spectrogram by ex
90110

91111
```
92112
# Export a preprocessor that can handle audio up to 5 mins (300s).
93-
python -m executorch.extension.audio.mel_spectrogram --feature_size 128 --stack_output --max_audio_len 300 --output_file voxtral_preprocessor.pte
113+
114+
python -m executorch.extension.audio.mel_spectrogram \
115+
--feature_size 128 \
116+
--stack_output \
117+
--max_audio_len 300 \
118+
--output_file voxtral_preprocessor.pte
94119
```
95120

96121
## Building the multimodal runner
@@ -124,6 +149,26 @@ cmake -DEXECUTORCH_BUILD_CUDA=ON \
124149
cmake --build cmake-out/examples/models/voxtral --target voxtral_runner --config Release
125150
```
126151

152+
### Building for Metal
153+
```
154+
# Install ExecuTorch with Metal support
155+
CMAKE_ARGS="-DEXECUTORCH_BUILD_METAL=ON" ./install_executorch.sh
156+
157+
# Build the multimodal runner with Metal
158+
cmake --preset llm \
159+
-DEXECUTORCH_BUILD_METAL=ON \
160+
-DCMAKE_INSTALL_PREFIX=cmake-out \
161+
-DCMAKE_BUILD_TYPE=Release \
162+
-Bcmake-out -S.
163+
cmake --build cmake-out -j16 --target install --config Release
164+
165+
cmake -DEXECUTORCH_BUILD_METAL=ON \
166+
-DCMAKE_BUILD_TYPE=Release \
167+
-Sexamples/models/voxtral \
168+
-Bcmake-out/examples/models/voxtral/
169+
cmake --build cmake-out/examples/models/voxtral --target voxtral_runner --config Release
170+
```
171+
127172
## Running the model
128173
You can download the `tekken.json` tokenizer from [Voxtral's HuggingFace repo](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507).
129174

@@ -148,13 +193,12 @@ If you already have a preprocessed mel spectrogram saved as a `.bin` file, you c
148193
--audio_path path/to/preprocessed_audio.bin
149194
```
150195

196+
### Running on CUDA or Metal:
197+
Add the `--data_path` argument to provide the appropriate data blob to the commands above:
198+
- For CUDA: `--data_path path/to/aoti_cuda_blob.ptd`
199+
- For Metal: `--data_path path/to/aoti_metal_blob.ptd`
151200

152-
**For CUDA:** Add the `--data_path` argument to provide the CUDA kernel blob to the commands above:
153-
```
154-
--data_path path/to/aoti_cuda_blob.ptd
155-
```
156-
157-
Example output:
201+
# Example output:
158202
```
159203
The speaker in this audio seems to be talking about their concerns about a device called the model or maybe they're just talking about the model in general. They mention that the model was trained with the speaker for inference, which suggests that
160204
the model was trained based on the speaker's data or instructions. They also mention that the volume is quite small, which could imply that the speaker is trying to control the volume of the model's output, likely because they are concerned about how loud the model's responses might
@@ -168,6 +212,7 @@ I 00:00:24.036822 executorch:stats.h:147] Time to first generated token:
168212
I 00:00:24.036828 executorch:stats.h:153] Sampling time over 487 tokens: 0.099000 (seconds)
169213
```
170214

215+
# Generating audio input
171216
You can easily produce an `.bin` for the audio input in Python like this:
172217
```
173218
# t = some torch.Tensor
@@ -180,3 +225,13 @@ You can also produce raw audio file as follows (for Option A):
180225
```
181226
ffmpeg -i audio.mp3 -f f32le -acodec pcm_f32le -ar 16000 audio_input.bin
182227
```
228+
229+
### Generating a .wav file on Mac
230+
On macOS, you can use the built-in `say` command to generate speech audio and convert it to a `.wav` file:
231+
```
232+
# Generate audio using text-to-speech
233+
say -o call_samantha_hall.aiff "Call Samantha Hall"
234+
235+
# Convert to .wav format
236+
afconvert -f WAVE -d LEI16 call_samantha_hall.aiff call_samantha_hall.wav
237+
```

0 commit comments

Comments
 (0)