-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VAD audio chunking #135
VAD audio chunking #135
Conversation
Great work @jkrukowski! Did you see the Cut and Merge strategy in https://github.com/m-bain/whisperX? If we don't attempt to pack short segments into 30 seconds like above before padding, the worst case performance might regress below baseline (e.g. padding ~1-3s chunks to 30s a lot). Let me know if you think Cut and Merge is an extension we should leave as future work or bundle here :) Edit: Cut and Merge will also mean some additional bookkeeping to adjust word-level timetamps post-inference. |
@Abhinay1997 Do you mind rebasing on top of this PR so we can add a WER check (w/ and w/o VAD-based chunking) on your long audio test sample? 🙏 |
Hey @atiorh ! No worries, I'll do that by tomorrow. Want to make sure there are no bugs/crash prone code in my PR. |
I'd leave it as a future work if possible. After talking to @ZachNagengast the other day I took a bit different approach here -- using VAD I'm trying to find the best cut off point in the 2nd half of 30sec audio chunk. So there is no risk of having a bunch of small segments padded with zeros (because the segment will contain at least 15 sec of the original audio). Having said that I think that cut and merge is a better (but more complicated) approach |
Makes sense, this is great. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really nice 👍 just a couple suggestions for clarity and future adaptability.
self.energyThreshold = energyThreshold | ||
} | ||
|
||
func voiceActivity(in waveform: [Float]) -> [Bool] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a public helper method that returns the exact clip timestamps in case someone wants to run the chunking ahead of time in their app and pass them directly via clipTimestamps
decoding option.
e.g.
let clips: [Int] = EnergyVAD().voiceActivity(in: audioArray)
let options = DecodingOptions(clipTimestamps: clips)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added calculateNonSilentChunks
method in AudioProcessor
which is backed by EnergyVAD
, this way we can keep EnergyVAD
class internal
import Accelerate | ||
|
||
/// Voice activity detection based on energy threshold | ||
final class EnergyVAD { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some other VAD code in the repo, could you consolidate that code by using this new class? May need a protocol for the other vad methods we have coming up (mel analysis, ML based), but your call on API design. I think the enum and existing chunking protocol solves this pretty well too fwiw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now I decided to move this other VAD code in our repo to isVoiceDetected
method in AudioProcessor
. This way we can keep EnergyVAD
internal till the full public interface of this class is ready
Support chunking VAD for paths
Support clip timestamps with vad
# Conflicts: # Tests/WhisperKitTests/RegressionTests.swift
@@ -506,6 +585,22 @@ public func mergeTranscriptionResults(_ results: [TranscriptionResult?], confirm | |||
) | |||
} | |||
|
|||
public func updateSegmentTimings(segment: TranscriptionSegment, seekTime: Float) -> TranscriptionSegment { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public func updateSegmentTimings(segment: TranscriptionSegment, seekTime: Float) -> TranscriptionSegment { | |
@available(macOS 13, iOS 16, watchOS 10, visionOS 1, *) | |
public func updateSegmentTimings(segment: TranscriptionSegment, seekTime: Float) -> TranscriptionSegment { |
Here's a recording of the example app running chunking about 4x faster with minimal WER loss 🚀 vad.chunking.example.mp4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved and good to go. Excellent work on this @jkrukowski this will be a massive improvement across the board for large workloads, and even helps with streaming if the buffer gets backed up by >30s
This PR introduces audio chunking with VAD. The VAD is used to detect speech segments in the audio file and then the audio is split into chunks based on the detected speech segments (and padded with zeros to match the 30sec length). Chunks are then processed in a batch resulting in a significant speedup.
Some benchmarks (on my mac book air m1):
Audio file 12:16 length
Audio file 40:26 length
To use it in WhisperKitCLI the user has to pass the
chunking-strategy
flag: