-
Notifications
You must be signed in to change notification settings - Fork 785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Whisper Language Detection #302
Comments
Hi there! 👋 Definitely a possibility I'd say! I assume the original (openai) library analyses attention scores across the different language tokens and picks the most likely one. Do you perhaps have example code for how to achieve this with the python |
Hi, def detect_language(model: WhisperForConditionalGeneration, tokenizer: WhisperTokenizer, input_features,
possible_languages: Optional[Collection[str]] = None) -> List[Dict[str, float]]:
# hacky, but all language tokens and only language tokens are 6 characters long
language_tokens = [t for t in tokenizer.additional_special_tokens if len(t) == 6]
if possible_languages is not None:
language_tokens = [t for t in language_tokens if t[2:-2] in possible_languages]
if len(language_tokens) < len(possible_languages):
raise RuntimeError(f'Some languages in {possible_languages} did not have associated language tokens')
language_token_ids = tokenizer.convert_tokens_to_ids(language_tokens)
# 50258 is the token for transcribing
logits = model(input_features,
decoder_input_ids = torch.tensor([[50258] for _ in range(input_features.shape[0])])).logits
mask = torch.ones(logits.shape[-1], dtype=torch.bool)
mask[language_token_ids] = False
logits[:, :, mask] = -float('inf')
output_probs = logits.softmax(dim=-1).cpu()
return [
{
lang: output_probs[input_idx, 0, token_id].item()
for token_id, lang in zip(language_token_ids, language_tokens)
}
for input_idx in range(logits.shape[0])
] |
Oh that looks much simpler than what I was expecting. I might need to make some modifications to the forward function for Could you provide some example input and output so that I can make sure my implementation matches your example? |
Is it possible to simply output the detected language as an event from the main pipeline? The following output sample is different from the output of the previous code, but I believe it would be sufficient since it provides the probability of each language: {"<LANG1>": 0.9, "<LANG2>": 0.05, "<LANG3>": 0.05} Furthermore, an important point to consider is whether to pass only a chunk of audio to this function to make it faster or to pass each chunk and return the language probabilities along with each chunk_callback. What are your thoughts on this structure? Is it possible? |
Name of the feature
Language Detection with the Whisper Model in Transformers.js
Reason for request
The original Whisper model includes a dedicated function for language detection. It would be awesome to have a similar capability within Transformers.js. In my current application, Whisper's models are running efficiently on the browser. However, for language detection, I find myself repeatedly making requests to my backend service.
Additional context
Is it possible to implement language detection as an event returned from the pipeline?
The text was updated successfully, but these errors were encountered: