-
Notifications
You must be signed in to change notification settings - Fork 31.9k
[ci-daily] Fix pipeline tests #21257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
| word_offsets = [] | ||
| offsets = [] | ||
| for word, (start_offset, end_offset) in chunk_offset: | ||
| word_offsets.append({"word": word, "start_offset": start_offset, "end_offset": end_offset}) | ||
| offsets.append({"word": word, "start_offset": start_offset, "end_offset": end_offset}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normalized the name of this variable
| yield {"is_last": True, **processed, **extra} | ||
|
|
||
| def _forward(self, model_inputs, generate_kwargs=None): | ||
| def _forward(self, model_inputs, return_timestamps=False, generate_kwargs=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this argument prevent the .pop from removing it for other processes.
| consecutive = np.where(timestamp_tokens[:-1] & timestamp_tokens[1:])[0] + 1 | ||
| last_timestamp = np.where(timestamp_tokens)[0][-1] | ||
| consecutive = np.append(consecutive, last_timestamp) if last_timestamp not in consecutive else consecutive | ||
| if seq_idx != 0: | ||
| if seq_idx != 0 and sum(timestamp_tokens) > 0: | ||
| consecutive = np.where(timestamp_tokens[:-1] & timestamp_tokens[1:])[0] + 1 | ||
| last_timestamp = np.where(timestamp_tokens)[0][-1] | ||
| consecutive = np.append(consecutive, last_timestamp) if last_timestamp not in consecutive else consecutive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just makes sure that if the model output no timestamps, we just don't throw an error
| if "input_features" in processed: | ||
| processed_len = processed["input_features"].shape[-1] | ||
| elif "input_values" in processed: | ||
| processed_len = processed["input_values"].shape[-1] | ||
| if processed_len != chunk.shape[-1] and rescale: | ||
| ratio = processed_len / chunk_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was missing! Fixes the LM tests
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if @Narsil agrees :-)
What does this PR do?
Should fix the
automatic_speech_recognition_pipelinetests.Also using
streamingdataset to speed up tests. Think it is a good idea if we are only using 1 data.