-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: macOS ASR Support #79
Comments
@linozen #87 should fix some of the issues you were seeing with pywhispercpp. After investigation, the main issues turned out to be the following:
The alignment and timestamp of words is important. As we advance the window, we look at historical transcription and when we find a common prefix, it is acknowledged and we further advance the window. It turns out that the whisper.cpp timestamps are... not great - so when we advance the window, we also advance our historical transcriptions, and because timestamps are not accurate, it mis-alignes the words. As a result, once the window has advanced, most of the time it will take two attempts to get a match. By contrast, with fast_whisper or regular whisper, advancing the history and the window typically keeps words aligned, and single iteration matches something with the historical transcription. I am not sure if I am making this clear, but essentially - it should work, but it is not optimal. Another difference with whisper.cpp is that, from my tests, it seems to handle multi-language better than faster_whisper and openai/whisper. I am used to seeing the transcription stop at a language change, or start translating instead of transcribing. whisper.cpp seems to do a much better job at transcribing the original language even with the wrong language has been requested, and at handling changes of language mid-window. Let me know how it goes for you. |
Thanks for you work on this. #87 indeed fixes all of the most egregious issues and your explanations help me understand the logic in the code. I see similar issues to the ones you outlined. I'll give it a shot and see what might be done about improving the accuracy of the timestamps. The issue in whisper.cpp is known and tracked here:
What I would like to try in a separate PR is using https://github.com/ml-explore/mlx-examples/tree/main/whisper. They support word-level timestamps and maybe they work better with verbatim's alignment logic. |
Awesome, good luck - I read that the HF transformers should also work: openai/whisper#984 I had attempted to use this interface earlier and I think I ended up struggling with the underlying assumption that audio would be continuously provided, whereas in verbatim, we repeat audio segments multiple times. The first transcription would work, and then you'd have to destroy the pipeline and re-create it otherwise it would start getting confused from seeing repeated text. This was immediately observable with the Air France sample in the project which starts with a French and English repetition of the same sentence. The HF pipeline thought it was a repetition and would ignore the English "welcome aboard ladies and gentlemen" completely. I ended up switching to |
So, I added the PR. But there are still some outstanding problems with how this interacts with the alignment logic in [00:00:00-00:00:02][en] Madame, Monsieur, bonjour et bienvenue à bord.
[00:00:03-00:00:04] Welcome aboard, ladies and gentlemen.
[00:00:06-00:00:10] For your safety and comfort, please take a moment to watch the following safety video.
[00:00:10-00:00:13] This film is about your safety on board.
[00:00:28-00:00:33] Whenever the seatbelt sign is on, your seatbelt must be securely fastened.
[00:00:34-00:00:39] For your safety, we recommend that you keep your seatbelt fastened and visible at all times while seated.
[00:00:42-00:00:44] To release the seatbelt, just lift the buckle.
|
Performance is great though 👌 |
Brilliant - I'll take a look tonight. This behaviour aligns with what I have seen before when whisper is configured with the wrong language (skipping text and translating instead of transcribing). The issue may just be with the language detection logic - I posted a comment in the review, I think for low duration we should return low probability so that verbatim retries with longer durations. I think it retries when it's below 0.5 currently. |
I tested and confirm - the speed is quite good ! I'm getting comparable speed on the mac studio as I get on the rtx 4070. I pushed a couple of minor adjustments under #90 Thanks again for this contribution ! |
Great! This is shaping up to be a great cross-platform package for transcription/diarization. One issue I see only at the end of the AirFrance clip when diarising: 2025-01-08T11:46:04Z [INFO][whispermlx.py:54][guess_language] Detected language: en
2025-01-08T11:46:04Z [INFO][whispermlx.py:77][transcribe] Transcribing audio window: window_ts=3537918, audio_ts=3754847
[00:00.000 --> 00:04.640] We encourage everyone to read the safety information leaflet located in the seat back pocket.
[00:05.220 --> 00:08.600] Merci pour votre attention. Nous vous souhaitons un bon vol.
[00:09.060 --> 00:12.640] Thank you for your attention. We wish you a very pleasant flight.
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' We': end_ts=3542398, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' encourage': end_ts=3547518, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' everyone': end_ts=3554238, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' to': end_ts=3559038, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' read': end_ts=3562878, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' the': end_ts=3565118, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' safety': end_ts=3570238, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' information': end_ts=3574078, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' leaflet': end_ts=3584638, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' located': end_ts=3592958, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' in': end_ts=3598078, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' the': end_ts=3599358, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' seat': end_ts=3602237, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' back': end_ts=3605118, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' pocket.': end_ts=3612158, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' Merci': end_ts=3630398, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' pour': end_ts=3634878, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' votre': end_ts=3637758, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' attention.': end_ts=3643198, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' Nous': end_ts=3655678, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' vous': end_ts=3657918, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' souhaitons': end_ts=3664638, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' un': end_ts=3667198, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' bon': end_ts=3670078, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' vol.': end_ts=3675518, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' Thank': end_ts=3691838, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' you': end_ts=3694398, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' for': end_ts=3696958, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' your': end_ts=3698878, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' attention.': end_ts=3703998, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' We': end_ts=3717118, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' wish': end_ts=3720638, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' you': end_ts=3722878, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' a': end_ts=3724478, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' very': end_ts=3728958, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' pleasant': end_ts=3735038, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' flight.': end_ts=3740158, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][verbatim.py:610][process_audio_window] [221.119875/14.3320625/13.2780625][00:03:41-00:03:41][en] We encourage
[00:03:41-00:03:53] everyone to read the safety information leaflet located in the seat back pocket. Merci pour votre attention. Nous vous souhaitons un bon vol. Thank you for your attention. We wish you a very pleasant flight.
[00:03:41-00:03:41][SPEAKER_01][en] We encourage
[00:03:41-00:03:53][None] everyone to read the safety information leaflet located in the seat back pocket. Merci pour votre attention. Nous vous souhaitons un bon vol. Thank you for your attention. We wish you a very pleasant flight.
[00:03:41-00:03:53][SPEAKER_01] everyone to read the safety information leaflet located in the seat back pocket. Merci pour votre attention. Nous vous souhaitons un bon vol. Thank you for your attention. We wish you a very pleasant flight. The last three segments are merged into one and assigned all to the English speaker |
Do you have any idea about my last message @gaspardpetit? Otherwise, we can close this completed :) |
I believe this is an issue with the end of transcript logic. You see, throughout the transcription, verbatim adds audio little by little and keeps a history of transcription from the previous attempts. When two attempts perfectly match, then the words are confirmed. When enough words are confirmed to form a complete utterance (ex. a sentence, or a logical fragment of it), then the utterance is acknowledge and we advance the window. At the very end, there is generally leftovers - words that have not yet been confirmed or acknowledged. There is separate logic to flush them, but this logic may not treat language / diarization correctly. This kind of issue should become more obvious when we start benchmarking against ground truths - I suggest we keep moving forward, and re-open a bug when this is reproduced by the test framework / metrics you've been working on. |
No description provided.
The text was updated successfully, but these errors were encountered: