Skip to content

Conversation

@ArthurZucker
Copy link
Collaborator

What does this PR do?

Should fix the automatic_speech_recognition_pipeline tests.
Also using streaming dataset to speed up tests. Think it is a good idea if we are only using 1 data.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jan 23, 2023

The documentation is not available anymore as the PR was closed or merged.

@ArthurZucker ArthurZucker changed the title use streaming dataset [ci-daily] Fix pipeline tests Jan 23, 2023
@ArthurZucker ArthurZucker marked this pull request as ready for review January 23, 2023 17:41
Comment on lines -638 to +645
word_offsets = []
offsets = []
for word, (start_offset, end_offset) in chunk_offset:
word_offsets.append({"word": word, "start_offset": start_offset, "end_offset": end_offset})
offsets.append({"word": word, "start_offset": start_offset, "end_offset": end_offset})
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalized the name of this variable

yield {"is_last": True, **processed, **extra}

def _forward(self, model_inputs, generate_kwargs=None):
def _forward(self, model_inputs, return_timestamps=False, generate_kwargs=None):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this argument prevent the .pop from removing it for other processes.

Comment on lines -104 to +113
consecutive = np.where(timestamp_tokens[:-1] & timestamp_tokens[1:])[0] + 1
last_timestamp = np.where(timestamp_tokens)[0][-1]
consecutive = np.append(consecutive, last_timestamp) if last_timestamp not in consecutive else consecutive
if seq_idx != 0:
if seq_idx != 0 and sum(timestamp_tokens) > 0:
consecutive = np.where(timestamp_tokens[:-1] & timestamp_tokens[1:])[0] + 1
last_timestamp = np.where(timestamp_tokens)[0][-1]
consecutive = np.append(consecutive, last_timestamp) if last_timestamp not in consecutive else consecutive
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just makes sure that if the model output no timestamps, we just don't throw an error

Comment on lines +74 to +79
if "input_features" in processed:
processed_len = processed["input_features"].shape[-1]
elif "input_values" in processed:
processed_len = processed["input_values"].shape[-1]
if processed_len != chunk.shape[-1] and rescale:
ratio = processed_len / chunk_len
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missing! Fixes the LM tests

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if @Narsil agrees :-)

@ArthurZucker ArthurZucker merged commit b80b221 into huggingface:main Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants