-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making Sense of TranscriptionSegment IDs #258
Comments
Good questions. The id is actually only relevant for a single transcription run, so it will reset to starting at 0 in streaming mode every time. However they will all come in sequentially in that mode, so you'd be best tracking them outside of the library. You could extend the result object to hold the latest seek time and sort that way, but you'll need a secondary sorting for transcriptions that have multiple segments, where the id will be valid. Otherwise in our example app we just store them all as they get confirmed using the end timestamp value and ignore the id value, you can see that logic here
|
Thank you for the explanation Zach. Having the index reset each time a new stream transcription run starts makes sense, and having them come in sequentially is how I have seen them execute, except for some edge cases where it will jump from index 0 to index 4 and then back to index 0, then continue to index 1, etc. I'm not too sure why that is occurring and I haven't found any reliable steps to reproduce it. I did see the 'lastConfirmedSegment' approach in the example app, looking at it closer now it makes sense using the timestamps. Though it was unclear why 'requiredSegmentsForConfirmation' is 4 by default? Assuming that each segment is 30s that means you wouldn't get a confirmed segment until 2m in, though this is not the case as I saw the example app confirm segments via a transition from light grey -> black much earlier, in this case the confirmed segments are much shorter than the 30s window. I did have a few other observations and queries, I'm not sure if here is the best place to discuss them as they differ to the index issue I mentioned above. But I'll quickly cover them anyway. When I initiate WhisperKit I do it as follows:
This happens as soon as the app launches so that the model is ready to go as soon as possible. In this case, When I am connected to the internet this works fine, it will load the model from disk if present, if not it will download the model from Hugging Face. Though if I am not connected to WIFI or Cellular and the model has already been downloaded on disk, it will not initiate. If the device is offline it needs to be initiated as follows:
This is how I fetch the local model folder:
Given this is the case, the logic I have implemented in my app uses the Reachability Library: Essentially it will check if the device is connected to the internet via Reachability, if so it will init WhisperKit via the online method, if not it will init via the offline method. Perhaps I am misusing the library or misunderstanding but is there a better way to do this? I appreciate the new Model State Callback, it's super helpful to issue UI Updates based on what the model is doing behind the scenes. One issue I have come across is that the callback is not active until WhisperKit is actually initialized, though during init, WhisperKit can change a model from 'downloading' -> 'downloaded' -> 'prewarming', etc. My example would be as follows:
I wonder if there is a way to pass the Model State Callback into the config or the initializer so that it can be used immediately? One last thing, I'm curious as to why the 'medium' and 'medium.en' models are not included in the WhisperKit Hugging Face Repo? I get that I can create my own but the default option is fantastic and it'd be nice to have these is options if possible. Thanks Zach! |
This is a heuristic we came up with that seemed reasonable, but is adjustable for your use case. Normally a 30s chunk will have more than 1 segment but in some cases (usually at the start) it will do the entire 30s in one segment - this is actually a todo to fix, since our current pipeline has some kind of bug that doesn't find as many timestamp tokens in the first window, but finds them more often in subsequent windows. The goal is just to make sure that the input audio is not partially transcribed with hallucinations at the end (whisper tends to repeat tokens at the very end of a partial window) but the unconfirmed segments tend to have decent results anyway.
This may be a regression, but we do have a init config parameter
This is a good callout, we also want to do this for the recent callbacks added with #240. #help-wanted
These models are unfortunately not good candidates for CoreML, but small is pretty close in regards to WER, and the recent turbo v20241018 is a decent alternative as well. |
To expand on this: ANECompiler generates an incorrect program for |
Hello,
Thank you very much for all of your work on this project, it's fantastic!
I've been building something around it and exploring how it works and had a question that I'm hoping you can help with please. In my context I am using Stream Only, essentially taking live audio from a microphone source and having it transcribe in real time.
When I call 'transcribeAudioSamples' based on the example project I get a
TranscriptionResult?
as a return. My current logic looks atTranscriptionResult -> 'segments'
which is an array:[TranscriptionSegment]
. I assume that the last segment is the most recent and use it to update my UI.In the example app it uses a ForEach to enumerate the 'confirmedSegments' and draws them, the problem I find with this approach is that SwiftUI re-draws the 'Text' each time, So when the UI updates, it gets replaced which doesn't feel fluent when dealing with longer lines of text.
The approach I have taken in my implementation is that I store a reference of
TranscriptionSegment -> id
which is essentially an index. Then when the TranscriptionResult is updated it refers to the previous segment via it'sid
and updates the old text with the new text in the UI. This assumes that the newest text is always the most accurate. I have found this to be correct most of the time, the initial results are a bit rough and then they are refined shortly after.I assumed that the ID would increment sequentially, this is usually the case but not always. Given Whisper works in 30s chunks this seems to be when the segment switches over. So
0s - 30s will be id: 0
, then30s - 60s will be id:1
, etc.Though sometimes when feeding it a lot of words/talking in a shorter time period, it will just up to 'id: 1' at say the 17s mark rather than 30s, then it will jump back to
id: 0
. Other times I've seen it jump fromid: 0 -> id: 4
then back down toid: 1
, this is in terms of what is being returned from WhisperKit as aTranscriptionResult?
. Any idea of what is going on here?When it jumps around like this the text is scattered all of the place. Sometimes the start of a spoken sentence is in
id: 0
, then the middle of the sentence goes up to the segment withid: 1
, then the tail end of the sentence is back toid: 0
, when it hits the 30s mark it seems to revert to normal.Any thoughts you can offer would be kindly appreciated.
Thanks!
The text was updated successfully, but these errors were encountered: