Skip to content

Conversation

@ArthurZucker
Copy link
Collaborator

What does this PR do?

The goal of this PR is to allow the users to do the following :

...
whisper_model.generate(audio, return_timestamps = True)
whisper_model.generate(audio, return_timestamps = True, task = Transcribe)

The language is automatically detected. This also simplifies the pipeline calls, and add a good example of generation_config 's intended usage.

Comment on lines 1251 to 1267
# priority: `generation_config` argument > `model.generation_config` (the default generation config)
if generation_config is None:
# legacy: users may modify the model configuration to control generation -- update the generation config
# model attribute accordingly, if it was created from the model config
if self.generation_config._from_model_config:
new_generation_config = GenerationConfig.from_model_config(self.config)
if new_generation_config != self.generation_config:
warnings.warn(
"You have modified the pretrained model configuration to control generation. This is a"
" deprecated strategy to control generation and will be removed soon, in a future version."
" Please use a generation configuration file (see"
" https://huggingface.co/docs/transformers/main_classes/text_generation)I don't agree with"
" this warning, the generation config can be different but the rest of the model is the"
" samne.........."
)
self.generation_config = new_generation_config
generation_config = self.generation_config
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is redundant with super.generate(). Would be good if self.generation_config was already created and would only require an update at this point.

@ArthurZucker ArthurZucker requested a review from Narsil January 23, 2023 10:54
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jan 23, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll delay the full review but gave some early comments.

The code is indeed cleaner that way !

# apply the `max_initial_timestamp` option
if input_ids.shape[1] == self.begin_index and self.max_initial_timestamp_index is not None:
last_allowed = self.timestamp_begin + self.max_initial_timestamp_index
if input_ids.shape[1] == self.begin_index and self.max_initial_timestamp_idx is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm under the impression Sylvain would favor index over idx (And I agree)

out = {"tokens": tokens}
if stride is not None:
out["stride"] = stride
if self.type == "seq2seq_whisper":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We needed to pop before generate.

If there's no need for it to pop before we can simplify be simply setting something like:

out = {**out, **model_inputs} or something slightly along those lines including only stride.

@bjelkenhed
Copy link

bjelkenhed commented Jan 23, 2023

"The language is automatically detected". From my experience the language detection by Whisper is very unreliable. Will it still be possible to specify language?

@ArthurZucker
Copy link
Collaborator Author

Sure, let's make sure we still allow the language to be past! Thanks for pointing this out

@ArthurZucker ArthurZucker self-assigned this Jan 23, 2023
@ArthurZucker
Copy link
Collaborator Author

Once #21257 is merged, the tests here should also pass !

Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super nice !
LGTM.

It does clean up quite nicely imo.

@ArthurZucker
Copy link
Collaborator Author

Pipeline tests need #21269 to be merge 😉

@ArthurZucker ArthurZucker requested a review from sgugger January 24, 2023 18:25
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from the doc. Thanks!

@ArthurZucker
Copy link
Collaborator Author

The two failing tests are from the latest modification of the multilingual tokenizer's config


forced_decoder_ids = []

if hasattr(generation_config, "is_multilingual") and generation_config.is_multilingual:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we first introduced the generation config. Unless the task and language were passed as inputs, we'd default to speech transcription with language detection

@ArthurZucker ArthurZucker deleted the refactor-whisper branch January 30, 2024 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Whisper] ASR Pipeline with "return_timestamps=True" gives IndexError: index -1 is out of bounds for axis 0 with size 0

6 participants