feat: macOS ASR Support #79

gaspardpetit · 2025-01-07T02:33:12Z

No description provided.

gaspardpetit · 2025-01-07T05:54:50Z

@linozen #87 should fix some of the issues you were seeing with pywhispercpp. After investigation, the main issues turned out to be the following:

Word timestamps were missing the window timestamp offset (as the 30 second window advances, window_ts is used to track the window offset. Words are in absolute time, so when we create them, we take the offset from whisper.cpp + window_ts.
whisper.cpp does not prefix words with space, as you had observed - so for now the space is manually added. I am not 100% sure what consequences this may have on group of words that are not separated with a space (ex. using hyphen, or apostrophe)...
whisper.cpp seems to start sequences with an empty word - this is now being filtered out

The alignment and timestamp of words is important. As we advance the window, we look at historical transcription and when we find a common prefix, it is acknowledged and we further advance the window.

It turns out that the whisper.cpp timestamps are... not great - so when we advance the window, we also advance our historical transcriptions, and because timestamps are not accurate, it mis-alignes the words. As a result, once the window has advanced, most of the time it will take two attempts to get a match. By contrast, with fast_whisper or regular whisper, advancing the history and the window typically keeps words aligned, and single iteration matches something with the historical transcription. I am not sure if I am making this clear, but essentially - it should work, but it is not optimal.

Another difference with whisper.cpp is that, from my tests, it seems to handle multi-language better than faster_whisper and openai/whisper. I am used to seeing the transcription stop at a language change, or start translating instead of transcribing. whisper.cpp seems to do a much better job at transcribing the original language even with the wrong language has been requested, and at handling changes of language mid-window.

Let me know how it goes for you.

linozen · 2025-01-07T10:42:04Z

Thanks for you work on this. #87 indeed fixes all of the most egregious issues and your explanations help me understand the logic in the code. I see similar issues to the ones you outlined. I'll give it a shot and see what might be done about improving the accuracy of the timestamps.

The issue in whisper.cpp is known and tracked here:

What I would like to try in a separate PR is using https://github.com/ml-explore/mlx-examples/tree/main/whisper. They support word-level timestamps and maybe they work better with verbatim's alignment logic.

gaspardpetit · 2025-01-07T13:37:20Z

Awesome, good luck - I read that the HF transformers should also work: openai/whisper#984

I had attempted to use this interface earlier and I think I ended up struggling with the underlying assumption that audio would be continuously provided, whereas in verbatim, we repeat audio segments multiple times. The first transcription would work, and then you'd have to destroy the pipeline and re-create it otherwise it would start getting confused from seeing repeated text. This was immediately observable with the Air France sample in the project which starts with a French and English repetition of the same sentence. The HF pipeline thought it was a repetition and would ignore the English "welcome aboard ladies and gentlemen" completely.

I ended up switching to faster_whisper after that, but perhaps the HF transformer can be reset.

linozen · 2025-01-07T14:31:46Z

So, I added the PR. But there are still some outstanding problems with how this interacts with the alignment logic in verbatim.py. See the transcript below:

[00:00:00-00:00:02][en] Madame, Monsieur, bonjour et bienvenue à bord.
[00:00:03-00:00:04] Welcome aboard, ladies and gentlemen.
[00:00:06-00:00:10] For your safety and comfort, please take a moment to watch the following safety video.
[00:00:10-00:00:13] This film is about your safety on board.
[00:00:28-00:00:33] Whenever the seatbelt sign is on, your seatbelt must be securely fastened.
[00:00:34-00:00:39] For your safety, we recommend that you keep your seatbelt fastened and visible at all times while seated.
[00:00:42-00:00:44] To release the seatbelt, just lift the buckle.

Some lines are translated (4th line should be French).
There is stuff missing, e.g. 00:00:13-00:00:28 and the transcripts are often incomplete.

linozen · 2025-01-07T14:32:36Z

Performance is great though 👌

gaspardpetit · 2025-01-07T14:45:05Z

Brilliant - I'll take a look tonight. This behaviour aligns with what I have seen before when whisper is configured with the wrong language (skipping text and translating instead of transcribing). The issue may just be with the language detection logic - I posted a comment in the review, I think for low duration we should return low probability so that verbatim retries with longer durations. I think it retries when it's below 0.5 currently.

gaspardpetit · 2025-01-08T07:30:34Z

I tested and confirm - the speed is quite good ! I'm getting comparable speed on the mac studio as I get on the rtx 4070. I pushed a couple of minor adjustments under #90

Thanks again for this contribution !

linozen · 2025-01-08T11:01:04Z

Great! This is shaping up to be a great cross-platform package for transcription/diarization.

One issue I see only at the end of the AirFrance clip when diarising:

2025-01-08T11:46:04Z [INFO][whispermlx.py:54][guess_language] Detected language: en
2025-01-08T11:46:04Z [INFO][whispermlx.py:77][transcribe] Transcribing audio window: window_ts=3537918, audio_ts=3754847
[00:00.000 --> 00:04.640]  We encourage everyone to read the safety information leaflet located in the seat back pocket.
[00:05.220 --> 00:08.600]  Merci pour votre attention. Nous vous souhaitons un bon vol.
[00:09.060 --> 00:12.640]  Thank you for your attention. We wish you a very pleasant flight.
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' We': end_ts=3542398, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' encourage': end_ts=3547518, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' everyone': end_ts=3554238, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' to': end_ts=3559038, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' read': end_ts=3562878, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' the': end_ts=3565118, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' safety': end_ts=3570238, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' information': end_ts=3574078, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' leaflet': end_ts=3584638, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' located': end_ts=3592958, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' in': end_ts=3598078, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' the': end_ts=3599358, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' seat': end_ts=3602237, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' back': end_ts=3605118, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' pocket.': end_ts=3612158, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' Merci': end_ts=3630398, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' pour': end_ts=3634878, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' votre': end_ts=3637758, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' attention.': end_ts=3643198, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' Nous': end_ts=3655678, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' vous': end_ts=3657918, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' souhaitons': end_ts=3664638, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' un': end_ts=3667198, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' bon': end_ts=3670078, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' vol.': end_ts=3675518, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' Thank': end_ts=3691838, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' you': end_ts=3694398, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' for': end_ts=3696958, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' your': end_ts=3698878, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' attention.': end_ts=3703998, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' We': end_ts=3717118, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' wish': end_ts=3720638, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' you': end_ts=3722878, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' a': end_ts=3724478, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' very': end_ts=3728958, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' pleasant': end_ts=3735038, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' flight.': end_ts=3740158, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][verbatim.py:610][process_audio_window] [221.119875/14.3320625/13.2780625][00:03:41-00:03:41][en] We encourage
[00:03:41-00:03:53] everyone to read the safety information leaflet located in the seat back pocket. Merci pour votre attention. Nous vous souhaitons un bon vol. Thank you for your attention. We wish you a very pleasant flight.


[00:03:41-00:03:41][SPEAKER_01][en] We encourage
[00:03:41-00:03:53][None] everyone to read the safety information leaflet located in the seat back pocket. Merci pour votre attention. Nous vous souhaitons un bon vol. Thank you for your attention. We wish you a very pleasant flight.
[00:03:41-00:03:53][SPEAKER_01] everyone to read the safety information leaflet located in the seat back pocket. Merci pour votre attention. Nous vous souhaitons un bon vol. Thank you for your attention. We wish you a very pleasant flight.

The last three segments are merged into one and assigned all to the English speaker SPEAKER_01. Do you have an intuition why that might be? As you can see the segments were correctly separated by mlx_whisper.

linozen · 2025-01-13T08:34:28Z

Do you have any idea about my last message @gaspardpetit? Otherwise, we can close this completed :)

gaspardpetit · 2025-01-13T14:42:23Z

I believe this is an issue with the end of transcript logic. You see, throughout the transcription, verbatim adds audio little by little and keeps a history of transcription from the previous attempts. When two attempts perfectly match, then the words are confirmed. When enough words are confirmed to form a complete utterance (ex. a sentence, or a logical fragment of it), then the utterance is acknowledge and we advance the window. At the very end, there is generally leftovers - words that have not yet been confirmed or acknowledged. There is separate logic to flush them, but this logic may not treat language / diarization correctly.

This kind of issue should become more obvious when we start benchmarking against ground truths - I suggest we keep moving forward, and re-open a bug when this is reproduced by the test framework / metrics you've been working on.

gaspardpetit added this to Verbatim Jan 7, 2025

gaspardpetit converted this from a draft issue Jan 7, 2025

gaspardpetit changed the title ~~MacOS ASR~~ feat: macOS ASR Support Jan 7, 2025

gaspardpetit added the enhancement New feature or request label Jan 7, 2025

gaspardpetit linked a pull request Jan 7, 2025 that will close this issue

Fix/improve pywhispercpp #87

Merged

gaspardpetit closed this as completed in #87 Jan 7, 2025

github-project-automation bot moved this from In progress to Done in Verbatim Jan 7, 2025

gaspardpetit linked a pull request Jan 7, 2025 that will close this issue

WIP: Integrate pywhispercpp #76

Merged

gaspardpetit moved this from Done to In progress in Verbatim Jan 7, 2025

gaspardpetit reopened this Jan 7, 2025

linozen mentioned this issue Jan 7, 2025

add mlx-whisper #88

Merged

gaspardpetit closed this as completed Jan 13, 2025

github-project-automation bot moved this from In progress to Done in Verbatim Jan 13, 2025

gaspardpetit mentioned this issue Jan 13, 2025

feat: WhisperCppTranscriber should support language detection #78

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: macOS ASR Support #79

feat: macOS ASR Support #79

gaspardpetit commented Jan 7, 2025

gaspardpetit commented Jan 7, 2025

linozen commented Jan 7, 2025

gaspardpetit commented Jan 7, 2025

linozen commented Jan 7, 2025

linozen commented Jan 7, 2025

gaspardpetit commented Jan 7, 2025

gaspardpetit commented Jan 8, 2025

linozen commented Jan 8, 2025

linozen commented Jan 13, 2025

gaspardpetit commented Jan 13, 2025

feat: macOS ASR Support #79

feat: macOS ASR Support #79

Comments

gaspardpetit commented Jan 7, 2025

gaspardpetit commented Jan 7, 2025

linozen commented Jan 7, 2025

gaspardpetit commented Jan 7, 2025

linozen commented Jan 7, 2025

linozen commented Jan 7, 2025

gaspardpetit commented Jan 7, 2025

gaspardpetit commented Jan 8, 2025

linozen commented Jan 8, 2025

linozen commented Jan 13, 2025

gaspardpetit commented Jan 13, 2025