-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix silence detection #175
base: main
Are you sure you want to change the base?
Conversation
… long pause followed by continuous speech
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aigerimmmm thanks for this submission, this is a hugely needed feature 🎉 Looks like a good start, please see the comments below and lmk what you think. Also on the added files: Is there a specific reason you need the new silent audio, rather than generating silence (all zeros) and appending it to the existing audio? We do that in a few tests you may want to reference. It would also be great to test the specific case where silence is above the threshold but the log probs for the full window are also above the threshold so we keep the tokens rather than skip the window. Thanks again fro working through this!
|
||
// MARK: - Helper Function | ||
|
||
func loadAudioSamples(forResource resource: String, withExtension ext: String) -> [Float] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea 👍
func testSilentAudio() async throws { | ||
let whisperKit = try await WhisperKit(modelFolder: tinyModelPath(), verbose: true, logLevel: .debug) | ||
|
||
let silentAudioSamples: [Float] = loadAudioSamples(forResource: "silent_audio", withExtension: "mp3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be [Float](repeating: 0.0, count: 16000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I've updated the testSilentAudio function to use [Float](repeating: 0.0, count: 16000) for silentAudioSamples.
@@ -157,6 +157,13 @@ final class TranscribeTask { | |||
try Task.checkCancellation() | |||
// Send to decoder to predict text tokens with fallback | |||
let decodingResult = try await decodeWithFallback(encoderSegment: encoderOutput, decodingOptions: options, callback: decodingCallback) | |||
|
|||
// Check for silence detection | |||
if decodingResult.noSpeechProb > (options.noSpeechThreshold ?? 0.7) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should also contain a check for the overall log prob of the window, see reference: https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/transcribe.py#L289
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I've added a check for the overall log probability of the window
if decodingResult.noSpeechProb > (options.noSpeechThreshold ?? 0.7) { | ||
print("Detected silence with noSpeechProb \(decodingResult.noSpeechProb), skipping segment.") | ||
// Skip processing for silent segments | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of break, this should just increase the seek and continue the loop https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/transcribe.py#L293
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are so right, added increase the seek and continue the loop
@@ -827,7 +892,7 @@ open class TextDecoder: TextDecoding, WhisperMLModel { | |||
let decodingFallback = DecodingFallback( | |||
options: options, | |||
isFirstTokenLogProbTooLow: isFirstTokenLogProbTooLow, | |||
noSpeechProb: noSpeechProb, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended? If I'm reading this right, I think the noSpeechProb should be passed based on the actual probability of the no speech token at the first index of predicted tokens https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L689-L693
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. I have made changes to ensure that noSpeechProb is calculated at the start of the transcript (SOT) token and passed to DecodingFallback. The commit is 3a3b512.
Sources/WhisperKit/Core/Models.swift
Outdated
@@ -317,9 +319,10 @@ public struct DecodingOptions { | |||
compressionRatioThreshold: Float? = 2.4, | |||
logProbThreshold: Float? = -1.0, | |||
firstTokenLogProbThreshold: Float? = -1.5, | |||
noSpeechThreshold: Float? = 0.6, | |||
noSpeechThreshold: Float? = 0.7, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default in openai/whisper is 0.6, can you stick with that value or is there a reason you're updating it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted the noSpeechThreshold value back to 0.6 to align with the default used in openai/whisper.
Sources/WhisperKit/Core/Models.swift
Outdated
concurrentWorkerCount: Int = 0, | ||
chunkingStrategy: ChunkingStrategy? = nil | ||
chunkingStrategy: ChunkingStrategy? = nil, | ||
ignorePrefillPromptForNoSpeechDetection: Bool = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit verbose and also not sure exactly what it's intended use is? Are you referencing existing code or was this a need that came up from your testing? I can see a need for something that runs the forward pass for SOT token when using the prefillData, but for prefill prompt we would still want to check the no speech prob specifically at the SOT token, after all the prefill prompt tokens have been passed through to generate their KV caches. Best place to do that would be here:
if tokenIndex < intialPromptIndex { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion! I did check the no speech probability after all prefill prompt tokens have been processed
@@ -278,3 +278,37 @@ open class LanguageLogitsFilter: LogitsFiltering { | |||
return indexes | |||
} | |||
} | |||
|
|||
@available(macOS 13, iOS 16, watchOS 10, visionOS 1, *) | |||
open class SilenceLogitsFilter: LogitsFiltering { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second look at the source, I'm not sure we need a filter for this, since that will suppress the log probs of the "predicted token" during inference. What you already have here https://github.com/argmaxinc/WhisperKit/pull/175/files#diff-5dd6579fc66020b1085535bce41d2c2cc399a0b2b8f0ba225fc89f39d9ebdbc8R402 is checking the specific no speech index, which does everything you would need already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that you could add to the filters is supressing the no speech token in the SupressTokensFilter
similar to this https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/decoding.py#L638-L640 (also needs the remaining suppress tokens but can be fixed later)
Co-authored-by: Zach Nagengast <[email protected]>
Co-authored-by: Zach Nagengast <[email protected]>
Hey @aigerimmmm just checking in, how are you feeling about the changes? |
Hi @ZachNagengast, I really appreciate the all suggestions and detailed review, it's very helpful for me 👍 I've been working on it and I'm going to submit today. |
…ility calculation for detectSilence
…and added test for testDetectSilenceHelperMethod
@aigerimmmm Checking in, is this ready for another review? |
Hi @ZachNagengast thank you for reply, yes I am ready anytime for another review. |
PR for the issue #27.