Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Whisper Language Detection #302

Open
FelippeChemello opened this issue Sep 14, 2023 · 4 comments
Open

[Feature request] Whisper Language Detection #302

FelippeChemello opened this issue Sep 14, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@FelippeChemello
Copy link

Name of the feature
Language Detection with the Whisper Model in Transformers.js

Reason for request
The original Whisper model includes a dedicated function for language detection. It would be awesome to have a similar capability within Transformers.js. In my current application, Whisper's models are running efficiently on the browser. However, for language detection, I find myself repeatedly making requests to my backend service.

Additional context
Is it possible to implement language detection as an event returned from the pipeline?

@FelippeChemello FelippeChemello added the enhancement New feature or request label Sep 14, 2023
@xenova
Copy link
Collaborator

xenova commented Sep 15, 2023

Hi there! 👋 Definitely a possibility I'd say! I assume the original (openai) library analyses attention scores across the different language tokens and picks the most likely one. Do you perhaps have example code for how to achieve this with the python transformers library?

@FelippeChemello
Copy link
Author

Hi,
I found this code in a thread discussion on Hugging Face. It's not part of the default implementation of the transformers library, but perhaps it could be added to the library to make it easier to obtain this information.

def detect_language(model: WhisperForConditionalGeneration, tokenizer: WhisperTokenizer, input_features,
                    possible_languages: Optional[Collection[str]] = None) -> List[Dict[str, float]]:
    # hacky, but all language tokens and only language tokens are 6 characters long
    language_tokens = [t for t in tokenizer.additional_special_tokens if len(t) == 6]
    if possible_languages is not None:
        language_tokens = [t for t in language_tokens if t[2:-2] in possible_languages]
        if len(language_tokens) < len(possible_languages):
            raise RuntimeError(f'Some languages in {possible_languages} did not have associated language tokens')

    language_token_ids = tokenizer.convert_tokens_to_ids(language_tokens)

    # 50258 is the token for transcribing
    logits = model(input_features,
                   decoder_input_ids = torch.tensor([[50258] for _ in range(input_features.shape[0])])).logits
    mask = torch.ones(logits.shape[-1], dtype=torch.bool)
    mask[language_token_ids] = False
    logits[:, :, mask] = -float('inf')

    output_probs = logits.softmax(dim=-1).cpu()
    return [
        {
            lang: output_probs[input_idx, 0, token_id].item()
            for token_id, lang in zip(language_token_ids, language_tokens)
        }
        for input_idx in range(logits.shape[0])
    ]

@xenova
Copy link
Collaborator

xenova commented Sep 15, 2023

Oh that looks much simpler than what I was expecting. I might need to make some modifications to the forward function for WhisperForConditionalGeneration in transformers.js, but the majority of the functionality needed is already done.

Could you provide some example input and output so that I can make sure my implementation matches your example?

@FelippeChemello
Copy link
Author

FelippeChemello commented Sep 15, 2023

Is it possible to simply output the detected language as an event from the main pipeline?

The following output sample is different from the output of the previous code, but I believe it would be sufficient since it provides the probability of each language:

{"<LANG1>": 0.9, "<LANG2>": 0.05, "<LANG3>": 0.05}

Furthermore, an important point to consider is whether to pass only a chunk of audio to this function to make it faster or to pass each chunk and return the language probabilities along with each chunk_callback.

What are your thoughts on this structure? Is it possible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants