VAD audio chunking #135

jkrukowski · 2024-05-06T16:18:24Z

This PR introduces audio chunking with VAD. The VAD is used to detect speech segments in the audio file and then the audio is split into chunks based on the detected speech segments (and padded with zeros to match the 30sec length). Chunks are then processed in a batch resulting in a significant speedup.

Some benchmarks (on my mac book air m1):

Audio file 12:16 length

with VAD:

38.16s user 5.86s system 470% cpu 9.349 total

without VAD:

33.25s user 3.55s system 132% cpu 27.678 total

Audio file 40:26 length

with VAD:

126.54s user 18.41s system 500% cpu 28.952 total

without VAD:

96.55s user 10.47s system 133% cpu 1:20.08 total

To use it in WhisperKitCLI the user has to pass the chunking-strategy flag:

swift run -c release whisperkit-cli transcribe --audio-path /path/to/audio.wav --chunking-strategy vad

atiorh · 2024-05-06T17:17:28Z

Great work @jkrukowski! Did you see the Cut and Merge strategy in https://github.com/m-bain/whisperX?

If we don't attempt to pack short segments into 30 seconds like above before padding, the worst case performance might regress below baseline (e.g. padding ~1-3s chunks to 30s a lot). Let me know if you think Cut and Merge is an extension we should leave as future work or bundle here :)

Edit: Cut and Merge will also mean some additional bookkeeping to adjust word-level timetamps post-inference.

atiorh · 2024-05-06T17:43:20Z

@Abhinay1997 Do you mind rebasing on top of this PR so we can add a WER check (w/ and w/o VAD-based chunking) on your long audio test sample? 🙏

Abhinay1997 · 2024-05-06T17:45:30Z

Hey @atiorh ! No worries, I'll do that by tomorrow. Want to make sure there are no bugs/crash prone code in my PR.

jkrukowski · 2024-05-07T07:50:36Z

Great work @jkrukowski! Did you see the Cut and Merge strategy in https://github.com/m-bain/whisperX?

If we don't attempt to pack short segments into 30 seconds like above before padding, the worst case performance might regress below baseline (e.g. padding ~1-3s chunks to 30s a lot). Let me know if you think Cut and Merge is an extension we should leave as future work or bundle here :)

Edit: Cut and Merge will also mean some additional bookkeeping to adjust word-level timetamps post-inference.

I'd leave it as a future work if possible. After talking to @ZachNagengast the other day I took a bit different approach here -- using VAD I'm trying to find the best cut off point in the 2nd half of 30sec audio chunk. So there is no risk of having a bunch of small segments padded with zeros (because the segment will contain at least 15 sec of the original audio). Having said that I think that cut and merge is a better (but more complicated) approach

atiorh · 2024-05-07T16:08:32Z

Great work @jkrukowski! Did you see the Cut and Merge strategy in https://github.com/m-bain/whisperX?
If we don't attempt to pack short segments into 30 seconds like above before padding, the worst case performance might regress below baseline (e.g. padding ~1-3s chunks to 30s a lot). Let me know if you think Cut and Merge is an extension we should leave as future work or bundle here :)
Edit: Cut and Merge will also mean some additional bookkeeping to adjust word-level timetamps post-inference.

I'd leave it as a future work if possible. After talking to @ZachNagengast the other day I took a bit different approach here -- using VAD I'm trying to find the best cut off point in the 2nd half of 30sec audio chunk. So there is no risk of having a bunch of small segments padded with zeros (because the segment will contain at least 15 sec of the original audio). Having said that I think that cut and merge is a better (but more complicated) approach

Makes sense, this is great.

ZachNagengast

Looks really nice 👍 just a couple suggestions for clarity and future adaptability.

ZachNagengast · 2024-05-07T16:25:16Z

Sources/WhisperKit/Core/EnergyVAD.swift

+        self.energyThreshold = energyThreshold
+    }
+
+    func voiceActivity(in waveform: [Float]) -> [Bool] {


It would be nice to have a public helper method that returns the exact clip timestamps in case someone wants to run the chunking ahead of time in their app and pass them directly via clipTimestamps decoding option.

e.g.

let clips: [Int] = EnergyVAD().voiceActivity(in: audioArray) let options = DecodingOptions(clipTimestamps: clips)

Added calculateNonSilentChunks method in AudioProcessor which is backed by EnergyVAD, this way we can keep EnergyVAD class internal

ZachNagengast · 2024-05-07T16:28:05Z

Sources/WhisperKit/Core/EnergyVAD.swift

+import Accelerate
+
+/// Voice activity detection based on energy threshold
+final class EnergyVAD {


We have some other VAD code in the repo, could you consolidate that code by using this new class? May need a protocol for the other vad methods we have coming up (mel analysis, ML based), but your call on API design. I think the enum and existing chunking protocol solves this pretty well too fwiw.

for now I decided to move this other VAD code in our repo to isVoiceDetected method in AudioProcessor. This way we can keep EnergyVAD internal till the full public interface of this class is ready

Sources/WhisperKit/Core/EnergyVAD.swift

Support chunking VAD for paths

Support clip timestamps with vad

# Conflicts: # Tests/WhisperKitTests/RegressionTests.swift

ZachNagengast · 2024-05-23T15:28:32Z

Sources/WhisperKit/Core/Utils.swift

@@ -506,6 +585,22 @@ public func mergeTranscriptionResults(_ results: [TranscriptionResult?], confirm
    )
 }

+public func updateSegmentTimings(segment: TranscriptionSegment, seekTime: Float) -> TranscriptionSegment {


Suggested change

public func updateSegmentTimings(segment: TranscriptionSegment, seekTime: Float) -> TranscriptionSegment {

@available(macOS 13, iOS 16, watchOS 10, visionOS 1, *)

public func updateSegmentTimings(segment: TranscriptionSegment, seekTime: Float) -> TranscriptionSegment {

ZachNagengast · 2024-05-23T16:43:09Z

Here's a recording of the example app running chunking about 4x faster with minimal WER loss 🚀

vad.chunking.example.mp4

ZachNagengast

Approved and good to go. Excellent work on this @jkrukowski this will be a massive improvement across the board for large workloads, and even helps with streaming if the buffer gets backed up by >30s

added audio chunker, added energy based vad, added tests

2d43a95

jkrukowski mentioned this pull request May 6, 2024

Audio chunking #125

Closed

jkrukowski added 2 commits May 6, 2024 18:24

fixed compilation

77fcfd0

fixed compilation

9170083

extracted prepareSeekClips function

57a5c3d

ZachNagengast requested changes May 7, 2024

View reviewed changes

review changes

5da2669

jkrukowski requested a review from ZachNagengast May 7, 2024 20:11

ZachNagengast and others added 7 commits May 15, 2024 23:12

Support chunking VAD for paths

65cb888

Merge branch 'main' into vad-chunking

23a7752

Updates from review

bbd07ce

Merge pull request #1 from argmaxinc/vad-chunking

4cdae1c

Support chunking VAD for paths

Support clip timestamps with vad

252f84a

Merge pull request #2 from argmaxinc/vad-chunking

2e0e7ba

Support clip timestamps with vad

Merge branch 'main' into vad-chunking

5c8b07e

# Conflicts: # Tests/WhisperKitTests/RegressionTests.swift

ZachNagengast reviewed May 23, 2024

View reviewed changes

jkrukowski and others added 3 commits May 23, 2024 17:31

fix compilation error

f930eda

PR review and cleanup

27f499b

Fix test normalization order

5632429

ZachNagengast added 2 commits May 23, 2024 11:07

UI and qol tweaks for example app

b847da6

Fix test normalization

524a10d

ZachNagengast self-requested a review May 23, 2024 18:21

ZachNagengast approved these changes May 23, 2024

View reviewed changes

Reduce accuracy requirement for vad chunker

8309b27

ZachNagengast added 2 commits May 23, 2024 11:45

Fix example app sidebar visibility

9144e2e

Further test normailziation fixes

3d46276

ZachNagengast merged commit 09aa70b into argmaxinc:main May 23, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VAD audio chunking #135

VAD audio chunking #135

jkrukowski commented May 6, 2024 •

edited

Loading

atiorh commented May 6, 2024 •

edited

Loading

atiorh commented May 6, 2024 •

edited

Loading

Abhinay1997 commented May 6, 2024

jkrukowski commented May 7, 2024 •

edited

Loading

atiorh commented May 7, 2024

ZachNagengast left a comment

ZachNagengast May 7, 2024

jkrukowski May 7, 2024

ZachNagengast May 7, 2024

jkrukowski May 7, 2024

ZachNagengast May 23, 2024

ZachNagengast commented May 23, 2024

ZachNagengast left a comment

	public func updateSegmentTimings(segment: TranscriptionSegment, seekTime: Float) -> TranscriptionSegment {
	@available(macOS 13, iOS 16, watchOS 10, visionOS 1, *)
	public func updateSegmentTimings(segment: TranscriptionSegment, seekTime: Float) -> TranscriptionSegment {

VAD audio chunking #135

VAD audio chunking #135

Conversation

jkrukowski commented May 6, 2024 • edited Loading

Audio file 12:16 length

Audio file 40:26 length

atiorh commented May 6, 2024 • edited Loading

atiorh commented May 6, 2024 • edited Loading

Abhinay1997 commented May 6, 2024

jkrukowski commented May 7, 2024 • edited Loading

atiorh commented May 7, 2024

ZachNagengast left a comment

Choose a reason for hiding this comment

ZachNagengast May 7, 2024

Choose a reason for hiding this comment

jkrukowski May 7, 2024

Choose a reason for hiding this comment

ZachNagengast May 7, 2024

Choose a reason for hiding this comment

jkrukowski May 7, 2024

Choose a reason for hiding this comment

ZachNagengast May 23, 2024

Choose a reason for hiding this comment

ZachNagengast commented May 23, 2024

ZachNagengast left a comment

Choose a reason for hiding this comment

jkrukowski commented May 6, 2024 •

edited

Loading

atiorh commented May 6, 2024 •

edited

Loading

atiorh commented May 6, 2024 •

edited

Loading

jkrukowski commented May 7, 2024 •

edited

Loading