-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix language detection #133
Fix language detection #133
Conversation
return DecodingResult( | ||
language: detectedLanguage, | ||
languageProbs: [:], | ||
tokens: [], | ||
tokenLogProbs: [], | ||
text: "", | ||
avgLogProb: 0.0, | ||
noSpeechProb: 0.0, | ||
temperature: 0.0, | ||
compressionRatio: 0.0, | ||
cache: nil, | ||
timings: timings, | ||
fallback: nil | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could as well change the interface of this function and return the detected (optional) language code instead of DecodingResult
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable, can it also return the languageProbs
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable, can it also return the
languageProbs
here?
so detected language together with languageProbs instead of DecodingResult? will do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better to keep an empty DecodingResult because it needs timings as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looked, I dont think the timings are being accounted for with this at all, would be nice to merge these in this PR too while you're in the code
WhisperKit/Sources/WhisperKit/Core/TranscribeTask.swift
Lines 273 to 319 in c20943d
if textDecoder.isModelMultilingual, options.language == nil, options.detectLanguage { | |
let languageDecodingResult: DecodingResult? = try? await textDecoder.detectLanguage( | |
from: encoderOutput, | |
using: decoderInputs, | |
sampler: tokenSampler, | |
options: options, | |
temperature: temp | |
) | |
// Update the language decoding options | |
currentDecodingOptions.language = languageDecodingResult?.language | |
detectedLanguage = languageDecodingResult?.language | |
// Update prompt and KV Cache if needed | |
if options.usePrefillPrompt { | |
decoderInputs = try await textDecoder.prefillDecoderInputs(decoderInputs, withOptions: currentDecodingOptions) | |
} | |
Logging.debug("Prefill prompt updated to: \(decoderInputs.initialPrompt.map { tokenizer.convertIdToToken($0) ?? "" })") | |
} | |
decodingResult = try await textDecoder.decodeText( | |
from: encoderOutput, | |
using: decoderInputs, | |
sampler: tokenSampler, | |
options: currentDecodingOptions, | |
callback: callback | |
) | |
// Use the predicted language if it was not detected ahead of time | |
if detectedLanguage == nil { | |
detectedLanguage = decodingResult?.language | |
} | |
// Update timings from the decoder main loop | |
if let decodingTimings = decodingResult?.timings { | |
if timings.firstTokenTime == 0 { | |
timings.firstTokenTime = decodingTimings.firstTokenTime | |
} | |
timings.decodingPredictions += decodingTimings.decodingPredictions | |
timings.totalDecodingLoops += decodingTimings.totalDecodingLoops | |
timings.decodingNonPrediction += decodingTimings.decodingNonPrediction | |
timings.decodingFiltering += decodingTimings.decodingFiltering | |
timings.decodingSampling += decodingTimings.decodingSampling | |
timings.decodingKvCaching += decodingTimings.decodingKvCaching | |
timings.totalKVUpdateRuns += decodingTimings.totalKVUpdateRuns | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done, please take look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this look great - well done 💯
} else { | ||
detectedLanguage = Constants.defaultLanguageCode | ||
Logging.error("Detected language \(sampledLanguage) is not supported, defaulting to \(Constants.defaultLanguageCode)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this error happen often or just a precaution?
func trimmingSpecialTokenCharacters() -> String { | ||
trimmingCharacters(in: Constants.specialTokenCharacters) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
While working on audio chunking I noticed that sometimes language detection is off. It can happen that language is detected as
<|nocaptions|>
which might result in a whole 30s segment being discarded.This PR fixes that by defaulting to "en" when the language detection is confused.