Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example app VAD default + memory reduction #217

Merged
merged 11 commits into from
Oct 8, 2024

Conversation

ZachNagengast
Copy link
Contributor

@ZachNagengast ZachNagengast commented Oct 8, 2024

This PR sets the example app and CLI to use VAD as the default setting. VAD uses a lot of memory for async predicitons so this also includes some improvements to memory / thread handling in general. There is future work to improve this (see #209) but I'm including one of @keleftheriou's fixes here in 2770d84.

Memory issue detail

For very large files, there was a large spike as it was copied into a float array for the model to consume (peaks at 2gb):
Screenshot 2024-10-06 at 6 03 34 PM

Now it will directly convert the audio into a float array in chunks to mitigate this (never goes above 1gb):
Screenshot 2024-10-06 at 6 05 20 PM

This has a speed reduction of about 20% to process the full file, which can surely be improved.

ZachNagengast and others added 6 commits October 6, 2024 17:00
- Reduces peak memory by doing the array conversion while loading in chunks so the array copy size is lower
- Previously copied the entire buffer which spiked the memory 2x
@ZachNagengast ZachNagengast requested a review from a2they October 8, 2024 02:08
- Optional cli commands are deprecated
- @_disfavoredOverload required @available to prevent infinite loop
Tests/WhisperKitTests/UnitTests.swift Outdated Show resolved Hide resolved
Sources/WhisperKit/Core/Audio/AudioProcessor.swift Outdated Show resolved Hide resolved
@ZachNagengast ZachNagengast merged commit e3e21d4 into main Oct 8, 2024
9 checks passed
@kaiwen-wang
Copy link

kaiwen-wang commented Jan 25, 2025

Based on this would it be better to not use VAD if we want quality? Since the model handles speech start/end, isn't the only benefit of VAD essentially not processing silent audio

and not splitting the text arbitrarily at periodic intervals

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants