Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: macOS ASR Support #79

Closed
gaspardpetit opened this issue Jan 7, 2025 · 10 comments · Fixed by #87 or #76
Closed

feat: macOS ASR Support #79

gaspardpetit opened this issue Jan 7, 2025 · 10 comments · Fixed by #87 or #76
Labels
enhancement New feature or request

Comments

@gaspardpetit
Copy link
Owner

No description provided.

@gaspardpetit gaspardpetit converted this from a draft issue Jan 7, 2025
@gaspardpetit gaspardpetit changed the title MacOS ASR feat: macOS ASR Support Jan 7, 2025
@gaspardpetit gaspardpetit added the enhancement New feature or request label Jan 7, 2025
@gaspardpetit gaspardpetit linked a pull request Jan 7, 2025 that will close this issue
@gaspardpetit
Copy link
Owner Author

@linozen #87 should fix some of the issues you were seeing with pywhispercpp. After investigation, the main issues turned out to be the following:

  • Word timestamps were missing the window timestamp offset (as the 30 second window advances, window_ts is used to track the window offset. Words are in absolute time, so when we create them, we take the offset from whisper.cpp + window_ts.
  • whisper.cpp does not prefix words with space, as you had observed - so for now the space is manually added. I am not 100% sure what consequences this may have on group of words that are not separated with a space (ex. using hyphen, or apostrophe)...
  • whisper.cpp seems to start sequences with an empty word - this is now being filtered out

The alignment and timestamp of words is important. As we advance the window, we look at historical transcription and when we find a common prefix, it is acknowledged and we further advance the window.

It turns out that the whisper.cpp timestamps are... not great - so when we advance the window, we also advance our historical transcriptions, and because timestamps are not accurate, it mis-alignes the words. As a result, once the window has advanced, most of the time it will take two attempts to get a match. By contrast, with fast_whisper or regular whisper, advancing the history and the window typically keeps words aligned, and single iteration matches something with the historical transcription. I am not sure if I am making this clear, but essentially - it should work, but it is not optimal.

Another difference with whisper.cpp is that, from my tests, it seems to handle multi-language better than faster_whisper and openai/whisper. I am used to seeing the transcription stop at a language change, or start translating instead of transcribing. whisper.cpp seems to do a much better job at transcribing the original language even with the wrong language has been requested, and at handling changes of language mid-window.

Let me know how it goes for you.

@github-project-automation github-project-automation bot moved this from In progress to Done in Verbatim Jan 7, 2025
@gaspardpetit gaspardpetit linked a pull request Jan 7, 2025 that will close this issue
@gaspardpetit gaspardpetit moved this from Done to In progress in Verbatim Jan 7, 2025
@gaspardpetit gaspardpetit reopened this Jan 7, 2025
@linozen
Copy link
Collaborator

linozen commented Jan 7, 2025

Thanks for you work on this. #87 indeed fixes all of the most egregious issues and your explanations help me understand the logic in the code. I see similar issues to the ones you outlined. I'll give it a shot and see what might be done about improving the accuracy of the timestamps.

The issue in whisper.cpp is known and tracked here:

What I would like to try in a separate PR is using https://github.com/ml-explore/mlx-examples/tree/main/whisper. They support word-level timestamps and maybe they work better with verbatim's alignment logic.

@gaspardpetit
Copy link
Owner Author

Awesome, good luck - I read that the HF transformers should also work: openai/whisper#984

I had attempted to use this interface earlier and I think I ended up struggling with the underlying assumption that audio would be continuously provided, whereas in verbatim, we repeat audio segments multiple times. The first transcription would work, and then you'd have to destroy the pipeline and re-create it otherwise it would start getting confused from seeing repeated text. This was immediately observable with the Air France sample in the project which starts with a French and English repetition of the same sentence. The HF pipeline thought it was a repetition and would ignore the English "welcome aboard ladies and gentlemen" completely.

I ended up switching to faster_whisper after that, but perhaps the HF transformer can be reset.

@linozen linozen mentioned this issue Jan 7, 2025
@linozen
Copy link
Collaborator

linozen commented Jan 7, 2025

So, I added the PR. But there are still some outstanding problems with how this interacts with the alignment logic in verbatim.py. See the transcript below:

[00:00:00-00:00:02][en] Madame, Monsieur, bonjour et bienvenue à bord.
[00:00:03-00:00:04] Welcome aboard, ladies and gentlemen.
[00:00:06-00:00:10] For your safety and comfort, please take a moment to watch the following safety video.
[00:00:10-00:00:13] This film is about your safety on board.
[00:00:28-00:00:33] Whenever the seatbelt sign is on, your seatbelt must be securely fastened.
[00:00:34-00:00:39] For your safety, we recommend that you keep your seatbelt fastened and visible at all times while seated.
[00:00:42-00:00:44] To release the seatbelt, just lift the buckle.
  1. Some lines are translated (4th line should be French).
  2. There is stuff missing, e.g. 00:00:13-00:00:28 and the transcripts are often incomplete.

@linozen
Copy link
Collaborator

linozen commented Jan 7, 2025

Performance is great though 👌

@gaspardpetit
Copy link
Owner Author

Brilliant - I'll take a look tonight. This behaviour aligns with what I have seen before when whisper is configured with the wrong language (skipping text and translating instead of transcribing). The issue may just be with the language detection logic - I posted a comment in the review, I think for low duration we should return low probability so that verbatim retries with longer durations. I think it retries when it's below 0.5 currently.

@gaspardpetit
Copy link
Owner Author

I tested and confirm - the speed is quite good ! I'm getting comparable speed on the mac studio as I get on the rtx 4070. I pushed a couple of minor adjustments under #90

Thanks again for this contribution !

@linozen
Copy link
Collaborator

linozen commented Jan 8, 2025

Great! This is shaping up to be a great cross-platform package for transcription/diarization.

One issue I see only at the end of the AirFrance clip when diarising:

2025-01-08T11:46:04Z [INFO][whispermlx.py:54][guess_language] Detected language: en
2025-01-08T11:46:04Z [INFO][whispermlx.py:77][transcribe] Transcribing audio window: window_ts=3537918, audio_ts=3754847
[00:00.000 --> 00:04.640]  We encourage everyone to read the safety information leaflet located in the seat back pocket.
[00:05.220 --> 00:08.600]  Merci pour votre attention. Nous vous souhaitons un bon vol.
[00:09.060 --> 00:12.640]  Thank you for your attention. We wish you a very pleasant flight.
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' We': end_ts=3542398, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' encourage': end_ts=3547518, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' everyone': end_ts=3554238, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' to': end_ts=3559038, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' read': end_ts=3562878, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' the': end_ts=3565118, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' safety': end_ts=3570238, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' information': end_ts=3574078, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' leaflet': end_ts=3584638, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' located': end_ts=3592958, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' in': end_ts=3598078, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' the': end_ts=3599358, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' seat': end_ts=3602237, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' back': end_ts=3605118, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' pocket.': end_ts=3612158, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' Merci': end_ts=3630398, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' pour': end_ts=3634878, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' votre': end_ts=3637758, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' attention.': end_ts=3643198, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' Nous': end_ts=3655678, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' vous': end_ts=3657918, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' souhaitons': end_ts=3664638, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' un': end_ts=3667198, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' bon': end_ts=3670078, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' vol.': end_ts=3675518, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' Thank': end_ts=3691838, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' you': end_ts=3694398, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' for': end_ts=3696958, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' your': end_ts=3698878, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' attention.': end_ts=3703998, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' We': end_ts=3717118, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' wish': end_ts=3720638, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' you': end_ts=3722878, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' a': end_ts=3724478, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' very': end_ts=3728958, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' pleasant': end_ts=3735038, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][whispermlx.py:144][transcribe] Word ' flight.': end_ts=3740158, audio_ts=3754847
2025-01-08T11:46:06Z [INFO][verbatim.py:610][process_audio_window] [221.119875/14.3320625/13.2780625][00:03:41-00:03:41][en] We encourage
[00:03:41-00:03:53] everyone to read the safety information leaflet located in the seat back pocket. Merci pour votre attention. Nous vous souhaitons un bon vol. Thank you for your attention. We wish you a very pleasant flight.


[00:03:41-00:03:41][SPEAKER_01][en] We encourage
[00:03:41-00:03:53][None] everyone to read the safety information leaflet located in the seat back pocket. Merci pour votre attention. Nous vous souhaitons un bon vol. Thank you for your attention. We wish you a very pleasant flight.
[00:03:41-00:03:53][SPEAKER_01] everyone to read the safety information leaflet located in the seat back pocket. Merci pour votre attention. Nous vous souhaitons un bon vol. Thank you for your attention. We wish you a very pleasant flight.

The last three segments are merged into one and assigned all to the English speaker SPEAKER_01. Do you have an intuition why that might be? As you can see the segments were correctly separated by mlx_whisper.

@linozen
Copy link
Collaborator

linozen commented Jan 13, 2025

Do you have any idea about my last message @gaspardpetit? Otherwise, we can close this completed :)

@gaspardpetit
Copy link
Owner Author

I believe this is an issue with the end of transcript logic. You see, throughout the transcription, verbatim adds audio little by little and keeps a history of transcription from the previous attempts. When two attempts perfectly match, then the words are confirmed. When enough words are confirmed to form a complete utterance (ex. a sentence, or a logical fragment of it), then the utterance is acknowledge and we advance the window. At the very end, there is generally leftovers - words that have not yet been confirmed or acknowledged. There is separate logic to flush them, but this logic may not treat language / diarization correctly.

This kind of issue should become more obvious when we start benchmarking against ground truths - I suggest we keep moving forward, and re-open a bug when this is reproduced by the test framework / metrics you've been working on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants