Releases: argmaxinc/WhisperKit
v0.6.1
Smaller patch release with some nice improvements and two new contributors 🙌
Highlights
- Tokenizer no longer requires a HubApi request to succeed if the files are already downloaded
- This was a big request from the community and should enable offline transcription as long as everything is downloaded already
- Also made the function public so you can bundle the tokenizer with the app along with the model files
- @smpanaro found a really nice speedup across the board by using IOSurface backed MLMultiArrays
- Especially noticeable on older devices
- General cleanup, including a nice bug fix from @couche1 when streaming via the CLI
What's Changed
- Memory and Latency Regression Tests by @Abhinay1997 in #99
- @Abhinay1997 is building out this regression test suite so we can be sure we're always shipping code that has the same or better speed, accuracy, memory, etc
- Fix audio file requirement for streaming mode by @couche1 in #121
- Use IOSurface-backed MLMultiArrays for float16 by @smpanaro in #130
- Cleanup by @ZachNagengast in #132
New Contributors
Full Changelog: v0.6.0...v0.6.1
v0.6.0
Highlights
- Async batch transcription is here 🎉 contributed by @jkrukowski
- With this release, you can now simultaneously transcribe multiple audio files at once, fully utilizing the new async prediction APIs released with iOS17/macOS14 (see the wwdc video here).
- New interface with
audioPaths
input: -
let audioPaths = [ "/path/to/file1.wav", "/path/to/file2.wav" ] let whisperKit = try await WhisperKit() let transcriptionResults: [[TranscriptionResult]?] = await whisperKit.transcribe(audioPaths: audioPaths)
- You can also use it via the CLI using the new argument
--audio-folder "path/to/folder/"
- Future work will be chunking up single files to significantly speed up long-form transcription
- Note that this entails breaking changes and deprecations, see below for the full upgrade guide.
- Several bug fixes, accuracy improvements, and quality of life upgrades by @hewigovens @shawiz and @jkrukowski
- Every issue raised and PR merged from the community helps make WhisperKit better every release, thank you and keep them coming! 🙏
⚠️ Upgrade Guide
We aim to minimize breaking changes, so with this update we added a few deprecation flags for changed interfaces, which will be removed later but for now are still usable and will not throw build errors. There are some breaking changes for lower level and newer methods so if you do notice build errors click the dropdown below to see the full guide.
Full Upgrade Guide
API changes
Deprecations
WhisperKit
Deprecated
public func transcribe(
audioPath: String,
decodeOptions: DecodingOptions? = nil,
callback: TranscriptionCallback = nil
) async throws -> TranscriptionResult?
use instead
public func transcribe(
audioPath: String,
decodeOptions: DecodingOptions? = nil,
callback: TranscriptionCallback = nil
) async throws -> [TranscriptionResult]
Deprecated
public func transcribe(
audioArray: [Float],
decodeOptions: DecodingOptions? = nil,
callback: TranscriptionCallback = nil
) async throws -> TranscriptionResult?
use instead
public func transcribe(
audioArray: [Float],
decodeOptions: DecodingOptions? = nil,
callback: TranscriptionCallback = nil
) async throws -> [TranscriptionResult]
TextDecoding
Deprecated
func decodeText(
from encoderOutput: MLMultiArray,
using decoderInputs: DecodingInputs,
sampler tokenSampler: TokenSampling,
options decoderOptions: DecodingOptions,
callback: ((TranscriptionProgress) -> Bool?)?
) async throws -> [DecodingResult]
use instead
func decodeText(
from encoderOutput: MLMultiArray,
using decoderInputs: DecodingInputs,
sampler tokenSampler: TokenSampling,
options decoderOptions: DecodingOptions,
callback: ((TranscriptionProgress) -> Bool?)?
) async throws -> DecodingResult
Deprecated
func detectLanguage(
from encoderOutput: MLMultiArray,
using decoderInputs: DecodingInputs,
sampler tokenSampler: TokenSampling,
options: DecodingOptions,
temperature: FloatType
) async throws -> [DecodingResult]
use instead
func detectLanguage(
from encoderOutput: MLMultiArray,
using decoderInputs: DecodingInputs,
sampler tokenSampler: TokenSampling,
options: DecodingOptions,
temperature: FloatType
) async throws -> DecodingResult
Breaking changes
- removed
Transcriber
protocol
AudioProcessing
static func loadAudio(fromPath audioFilePath: String) -> AVAudioPCMBuffer?
becomes
static func loadAudio(fromPath audioFilePath: String) throws -> AVAudioPCMBuffer
AudioStreamTranscriber
public init(
audioProcessor: any AudioProcessing,
transcriber: any Transcriber,
decodingOptions: DecodingOptions,
requiredSegmentsForConfirmation: Int = 2,
silenceThreshold: Float = 0.3,
compressionCheckWindow: Int = 20,
useVAD: Bool = true,
stateChangeCallback: AudioStreamTranscriberCallback?
)
becomes
public init(
audioEncoder: any AudioEncoding,
featureExtractor: any FeatureExtracting,
segmentSeeker: any SegmentSeeking,
textDecoder: any TextDecoding,
tokenizer: any WhisperTokenizer,
audioProcessor: any AudioProcessing,
decodingOptions: DecodingOptions,
requiredSegmentsForConfirmation: Int = 2,
silenceThreshold: Float = 0.3,
compressionCheckWindow: Int = 20,
useVAD: Bool = true,
stateChangeCallback: AudioStreamTranscriberCallback?
)
TextDecoding
func prepareDecoderInputs(withPrompt initialPrompt: [Int]) -> DecodingInputs?
becomes
func prepareDecoderInputs(withPrompt initialPrompt: [Int]) throws -> DecodingInputs
What's Changed
- Add
microphoneUnavailable
error by @hewigovens in #113 - Improve token timestamps and language detection by @ZachNagengast in #114
- Respect skipSpecialTokens option in the decodingCallback function by @shawiz in #115
- Disallow invalid
--language
values by @jkrukowski in #116 - Run tests in parallel on CI by @jkrukowski in #117
- Async batch predictions by @jkrukowski in #107
New Contributors
- @hewigovens made their first contribution in #113
- @shawiz made their first contribution in #115
Full Changelog: v0.5.0...v0.6.0
v0.5.0
This is a HUGE release with some great new features and fixes 🙌
Highlights
- Timestamp logits filter by @jkrukowski
- Significantly improves the amount of timestamp tokens in a particular window, which helps a lot with segmentation
- This is on by default but can be disabled using the decoding option
withoutTimestamps: true
- Language detection by @Abhinay1997
- New function on the
TextDecoding
protocol which runs a single forward pass and reads the language logits to find the most likely language for the input audio - Enabled by default for decoding options where
usePrefilPrompt: false
and thelanguage: nil
and it is not an English only model.
- New function on the
- First token log prob thresholds fallback check by @jkrukowski
- This feature is not in the original openai implementation but helps reduce hallucinations quite a bit.
- Often, fallbacks due to log prob threshold are immediately identifiable by the first token, so this reduces the amount of forward passes needed to move to a higher temperature
- Distil whisper support
- Recently distil-large-v3 was released which massively speeds up predictions at minimal quality loss. We've converted and optimized 4 distil models to use in WhisperKit on CoreML, they're really fast!
- distil-large-v3
distil-large-v3_594MB
distil-large-v3_turbo
distil-large-v3_turbo_600MB - Note that these do not yet have word timestamp alignment heads, so can't be used with
wordTimestamps: true
- It can be run via CLI as well:
swift run whisperkit-cli transcribe --model-prefix "distil" --model "large-v3_turbo_600MB" --verbose --audio-path ~/your_audio.wav
⚠️ Experimental new stream mode
We added an experimental new mode for streaming in WhisperAX called "Eager streaming mode". We're still refining this feature but we think it can soon be a great way to do real-time transcription with Whisper. Give it a try in Testflight or take a look a the code and let us know how it can be improved.
Recommended settings for the best performance for this iteration are:
- Max tokens per loop < 100
- Max fallback count < 2
- Prompt and cache prefill true
Looking for feedback on:
- Token confirmation numbers that work well
- Model, device, and settings combinations that work well
RPReplay_Final1711775397.MP4
What's Changed
- CLI Task Handling in #85
- Added TimestampRulesFilter implementation by @jkrukowski in #45
- Support distil whisper models in #88
- Language Detection by @Abhinay1997 in #78
- Tokenizer refactor, tests cleanup by @jkrukowski in #87
- First token logProb thresholding by @jkrukowski in #90
- [#93] Add missing settings to decoding options by @cgfarmer4 in #94
- "Eager" streaming mode via word timestamps in #95
New Contributors
- @Abhinay1997 made their first contribution in #78
Full Changelog: v0.4.1...v0.5.0
v0.4.1
v0.4.0 was our first release on Homebrew, and this will be our first automated update to the formula, huge props to @jkrukowski for his contributions on this.
What's Changed
- Homebrew github action, updated readme by @jkrukowski in #79
- Fix setupModels error handling by @finnvoor in #80
- Use GPU for audio encoder on macOS 13 by @ZachNagengast in #83
- Some of the models were having issues with ANE on macOS 13 / iOS 16, which this resolves by defaulting to GPU on those.
Full Changelog: v0.4.0...v0.4.1
v0.4.0
Lots of nice fixes in this release!
⚠️ Breaking change
We had to rename the CLI entry point in preparation for homebrew distribution, here is how to use it now:
Old:
swift run transcribe --audio-path path/to/your/audio.mp3
New:
swift run whisperkit-cli transcribe --audio-path path/to/your/audio.mp3
What's Changed
- skip functional tests for models that are not downloaded. by @metropol in #48
- Fix crash with mic device sample rate mismatch by @ZachNagengast in #69
- WhisperKit CLI cleanup by @jkrukowski in #68
- Add
Progress
toWhisperKit
by @finnvoor in #71 - Updated swift-transformers and tokenizer changes by @jkrukowski in #72
- Updated swift-transformers, do not use background url session in CLI by @jkrukowski in #74
- Add pre-merge and pre-release tests by @ZachNagengast in #76
New Contributors
Full Changelog: v0.3.3...v0.4.0
v0.3.3
What's Changed
Some great contributions in this patch:
- Expose downloadBase in WhisperKit init by @finnvoor in #57
- Convenience for managing model files
- Add audio device selector to transcribe + take a stab at Delete/Retry models by @cgfarmer4 in #54
- Extends example app functionality
- Issue - 42 WhisperKit support simulator fixed by @bharat9806 in #52
- Fixes a couple of bugs that show up during development on simulators (and fixed a decoding bug in the process #63 ).
New Contributors
- @bharat9806 made their first contribution in #52
Full Changelog: v0.3.2...v0.3.3
v0.3.2
What's Changed
- Fixed Conformance of 'Float16' warning by @jkrukowski in #58
- Fix memory leak from non-async MLModel prediction by @finnvoor in #56
With these our build warnings are now down to 0 🎉
Full Changelog: v0.3.1...v0.3.2
v0.3.1
What's Changed
-
macOS 13 & iOS 16 support in #40
- We have made WhisperKit available on older OS versions based on community feedback.
- Please note that macOS 13 and iOS 16 performance will be degraded in terms of prediction latency, compile time, peak memory consumption.
- We have tested and recommend using
tiny
andbase
variants on devices with these older OS versions for a stable user experience. - If you run into any output correctness issues, please switch to using
cpuAndGPU
compute units (from the default ofcpuAndNeuralEngine
) via theModelComputeOptions
init parameter. - As always, if you notice any irregularities, please post an issue here for us to follow up on.
-
Implement selecting input device by @cgfarmer4 in #51
- Thanks to @cgfarmer4, macOS users can now select their preferred microphone, not just the default one. Check out @cgfarmer4's fantastic feature walkthrough, and dive into the fully implemented sample code in the WhisperAX example app to see it in action!
New Contributors
- @eltociear made their first contribution in #43
- @cgfarmer4 made their first contribution in #51
Full Changelog: v0.3.0...v0.3.1
v0.3.0
What's Changed
- Word Timestamp support in #38
- You can now generate word level timestamps with the new decoding option
wordTimestamps: true
or via the cli with --word-timestamps - They are included on each
TranscriptionSegment
in a newwords
parameter - Following up with demo code and example app integrations in a later release
- Example json output: https://gist.github.com/ZachNagengast/f36a751bc68a3b5f2c41ada8bcc33746
- Check out this example video from @finnvoor showing it in action:
- You can now generate word level timestamps with the new decoding option
Detail_202403010956142.mp4
- Allow setting a downloadBase so downloaded models are not forced into the user's Documents folder by @jordibruin in #34
- Streaming Microphone for CLI by @jkrukowski in #35
New Contributors
- @jordibruin made their first contribution in #34
Full Changelog: v0.2.1...v0.3.0
v0.2.1
What's Changed
- Added implementation for SuppressBlankFilter by @jkrukowski in #18
- Also includes a performance improvement for the common LogitFilter operation for filling in
-infinity
probability.
- Also includes a performance improvement for the common LogitFilter operation for filling in
- Fixed issue with swift package dependencies that point to commit hashes #21 reported by @sleeper
Full Changelog: v0.2.0...v0.2.1