Skip to content

Conversation

@zuazo
Copy link
Contributor

@zuazo zuazo commented Oct 16, 2023

What does this PR do?

This PR introduces a new script named convert_hf_to_openai.py that allows for the conversion of Hugging Face Whisper models back to the original OpenAI Whisper format. This just does the opposite of the convert_openai_to_hf.py script.

While Hugging Face is easier to use, for example, for fine-tuning and has many integrations, the original OpenAI Whisper library provides more fine-grained control over this specific model, facilitating the testing of new approaches and certain algorithms (at least in our case).

Doctests

I added a doctest at the beginning that passes, but it requires the openai-whisper package to be installed, so I left it disabled with the double >>. Not sure how do you prefer to handle this case: leave it like that, adding the Whisper package somewhere in the CI (like .github/workflows/doctests.yml), or in any other way.

Besides, even though the original convert_openai_to_hf.py script did not have them, let me know if you want me to add some tests to this. I have tested it myself to work with all the Whisper model sizes, even the Large V2.

Before submitting

Who can review?

Possible candidates:

@ArthurZucker
Copy link
Collaborator

Linking #20953 as it was asked quite a while ago. We don't usually add these in transformers and would rather add it to the ## resource section, as a link to a your repo with the script. WDYT @sanchit-gandhi

@sanchit-gandhi
Copy link
Contributor

Probably in the Resource section would be best here @zuazo! It becomes a bit difficult to maintain Transformers if we become a one-to-many export library (e.g. export Transformers format to any number of other libraries).

Curious to hear what parameters you need from the OpenAI implementation that we don't offer in Transformers though! We can certainly discuss on GitHub adding them to Transformers to improve the experience for you. Currently, we're a lot faster than OpenAI: https://twitter.com/reach_vb/status/1714741554030481786

@zuazo
Copy link
Contributor Author

zuazo commented Oct 25, 2023

Absolutely, it sounds reasonable. I will open a new PR to add it to the ## Resources section once we finish PR #26834 to avoid any merge conflicts.

Regarding our usage, we have experimented with HF fine-tuned Whisper models coupled with n-gram LMs. It seemed straightforward in the whisper library due to their existing BeamSearchDecoder, making it simple to incorporate a KenLM.

If there is a similar feature in Transformers that I overlooked, I apologize. Navigating through such a comprehensive library can sometimes be challenging.

@sanchit-gandhi
Copy link
Contributor

Interesting use-case! Did you find that the Whisper decoder was not enough to predict accurate spelling/transcriptions? The Whisper decoder is in effect an "internal" LM, since it plays the role of generating the text conditional on the encoder hidden-states. Is your n-gram LM trained with the same vocab size as Whisper, i.e. you use the Whisper logits in combination with the n-gram model to get your final transcription? We have something similar to this with Wav2Vec2 + n-gram here:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants