You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
See the "Building the multimodal runner" section below for instructions on building with CUDA support, and the "Running the model" section for runtime instructions.
76
76
77
+
## Metal Support
78
+
On Apple Silicon, you can enable the runner to run on Metal. Follow the export and runtime commands below:
79
+
80
+
### Exporting with Metal
81
+
```
82
+
optimum-cli export executorch \
83
+
--model "mistralai/Voxtral-Mini-3B-2507" \
84
+
--task "multimodal-text-to-text" \
85
+
--recipe "metal" \
86
+
--dtype bfloat16 \
87
+
--max_seq_len 1024 \
88
+
--output_dir="voxtral"
89
+
```
90
+
91
+
This will generate:
92
+
-`model.pte` - The exported model
93
+
-`aoti_metal_blob.ptd` - The Metal kernel blob required for runtime
94
+
95
+
See the "Building the multimodal runner" section below for instructions on building with Metal support, and the "Running the model" section for runtime instructions.
96
+
77
97
# Running the model
78
98
To run the model, we will use the Voxtral runner, which utilizes ExecuTorch's MultiModal runner API.
79
99
The Voxtral runner will do the following things:
@@ -90,7 +110,12 @@ We provide a simple way to transform raw audio data into a mel spectrogram by ex
90
110
91
111
```
92
112
# Export a preprocessor that can handle audio up to 5 mins (300s).
You can download the `tekken.json` tokenizer from [Voxtral's HuggingFace repo](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507).
129
174
@@ -148,13 +193,12 @@ If you already have a preprocessed mel spectrogram saved as a `.bin` file, you c
148
193
--audio_path path/to/preprocessed_audio.bin
149
194
```
150
195
196
+
### Running on CUDA or Metal:
197
+
Add the `--data_path` argument to provide the appropriate data blob to the commands above:
198
+
- For CUDA: `--data_path path/to/aoti_cuda_blob.ptd`
199
+
- For Metal: `--data_path path/to/aoti_metal_blob.ptd`
151
200
152
-
**For CUDA:** Add the `--data_path` argument to provide the CUDA kernel blob to the commands above:
153
-
```
154
-
--data_path path/to/aoti_cuda_blob.ptd
155
-
```
156
-
157
-
Example output:
201
+
# Example output:
158
202
```
159
203
The speaker in this audio seems to be talking about their concerns about a device called the model or maybe they're just talking about the model in general. They mention that the model was trained with the speaker for inference, which suggests that
160
204
the model was trained based on the speaker's data or instructions. They also mention that the volume is quite small, which could imply that the speaker is trying to control the volume of the model's output, likely because they are concerned about how loud the model's responses might
@@ -168,6 +212,7 @@ I 00:00:24.036822 executorch:stats.h:147] Time to first generated token:
168
212
I 00:00:24.036828 executorch:stats.h:153] Sampling time over 487 tokens: 0.099000 (seconds)
169
213
```
170
214
215
+
# Generating audio input
171
216
You can easily produce an `.bin` for the audio input in Python like this:
172
217
```
173
218
# t = some torch.Tensor
@@ -180,3 +225,13 @@ You can also produce raw audio file as follows (for Option A):
0 commit comments