[Feature request] Whisper Language Detection #302

FelippeChemello · 2023-09-14T02:20:05Z

Name of the feature
Language Detection with the Whisper Model in Transformers.js

Reason for request
The original Whisper model includes a dedicated function for language detection. It would be awesome to have a similar capability within Transformers.js. In my current application, Whisper's models are running efficiently on the browser. However, for language detection, I find myself repeatedly making requests to my backend service.

Additional context
Is it possible to implement language detection as an event returned from the pipeline?

xenova · 2023-09-15T13:55:53Z

Hi there! 👋 Definitely a possibility I'd say! I assume the original (openai) library analyses attention scores across the different language tokens and picks the most likely one. Do you perhaps have example code for how to achieve this with the python transformers library?

FelippeChemello · 2023-09-15T14:16:21Z

Hi,
I found this code in a thread discussion on Hugging Face. It's not part of the default implementation of the transformers library, but perhaps it could be added to the library to make it easier to obtain this information.

def detect_language(model: WhisperForConditionalGeneration, tokenizer: WhisperTokenizer, input_features,
                    possible_languages: Optional[Collection[str]] = None) -> List[Dict[str, float]]:
    # hacky, but all language tokens and only language tokens are 6 characters long
    language_tokens = [t for t in tokenizer.additional_special_tokens if len(t) == 6]
    if possible_languages is not None:
        language_tokens = [t for t in language_tokens if t[2:-2] in possible_languages]
        if len(language_tokens) < len(possible_languages):
            raise RuntimeError(f'Some languages in {possible_languages} did not have associated language tokens')

    language_token_ids = tokenizer.convert_tokens_to_ids(language_tokens)

    # 50258 is the token for transcribing
    logits = model(input_features,
                   decoder_input_ids = torch.tensor([[50258] for _ in range(input_features.shape[0])])).logits
    mask = torch.ones(logits.shape[-1], dtype=torch.bool)
    mask[language_token_ids] = False
    logits[:, :, mask] = -float('inf')

    output_probs = logits.softmax(dim=-1).cpu()
    return [
        {
            lang: output_probs[input_idx, 0, token_id].item()
            for token_id, lang in zip(language_token_ids, language_tokens)
        }
        for input_idx in range(logits.shape[0])
    ]

xenova · 2023-09-15T14:26:41Z

Oh that looks much simpler than what I was expecting. I might need to make some modifications to the forward function for WhisperForConditionalGeneration in transformers.js, but the majority of the functionality needed is already done.

Could you provide some example input and output so that I can make sure my implementation matches your example?

FelippeChemello · 2023-09-15T18:15:25Z

Is it possible to simply output the detected language as an event from the main pipeline?

The following output sample is different from the output of the previous code, but I believe it would be sufficient since it provides the probability of each language:

{"<LANG1>": 0.9, "<LANG2>": 0.05, "<LANG3>": 0.05}

Furthermore, an important point to consider is whether to pass only a chunk of audio to this function to make it faster or to pass each chunk and return the language probabilities along with each chunk_callback.

What are your thoughts on this structure? Is it possible?

FelippeChemello added the enhancement New feature or request label Sep 14, 2023

ae9is mentioned this issue Dec 13, 2024

Add Whisper language detection #1097

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Whisper Language Detection #302

[Feature request] Whisper Language Detection #302

FelippeChemello commented Sep 14, 2023

xenova commented Sep 15, 2023

FelippeChemello commented Sep 15, 2023

xenova commented Sep 15, 2023

FelippeChemello commented Sep 15, 2023 •

edited

Loading

[Feature request] Whisper Language Detection #302

[Feature request] Whisper Language Detection #302

Comments

FelippeChemello commented Sep 14, 2023

xenova commented Sep 15, 2023

FelippeChemello commented Sep 15, 2023

xenova commented Sep 15, 2023

FelippeChemello commented Sep 15, 2023 • edited Loading

FelippeChemello commented Sep 15, 2023 •

edited

Loading