Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[workflow] Word-level forced alignment with pretrained models from Torchaudio #827

Merged
merged 17 commits into from
Sep 29, 2022

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Sep 28, 2022

As a follow-up to #824 I'm adding another workflow for forced word-level alignment. It may be interesting to combine the two workflows together. In the end I used the torchaudio model because the tutorial example looks quite convincing with regard to word timestampt accuracy, but I did not evaluate if it's better or worse than any other approach. We can add other workflows for forced alignment later to create some choice.

@desh2608 I also refactored the AlignmentItem thing a bit so that it's NamedTuple again, for the sake of efficiency. I don't like this approach though, after using it for some time. I think we should eventually change it to something like a numpy array (or a collection of them) and maybe not store them in the manifest. Anyway, at least the size of the manifests is greatly reduced with the NamedTuple approach.

@pzelasko pzelasko added this to the v1.8 milestone Sep 28, 2022
@pzelasko pzelasko requested a review from desh2608 September 28, 2022 19:15
desh2608
desh2608 previously approved these changes Sep 28, 2022
Copy link
Collaborator

@desh2608 desh2608 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super-fast development on this one! LGTM, in general.

Regarding alignments, I agree it's not very efficient. Perhaps treating them in the same way as Features would be a better option.

@desh2608
Copy link
Collaborator

Here's one approach off the top of my head. We can have an AlignmentObject that contains 2 things:

  1. A dict which is just a mapping from a symbol (e.g., word, phone) to an int.
  2. A path to a numpy file, containing rows of the form (symbol_idx, start_time, end_time, score).

@pzelasko pzelasko marked this pull request as ready for review September 29, 2022 13:44
lhotse/bin/modes/workflows.py Outdated Show resolved Hide resolved
"""

symbol: str
start: Seconds
duration: Seconds
score: Optional[float] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the score indicate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's aligner-specific "confidence score", in case of torchaudio they suggested to provide the average token probability for a given segment in their tutorial; but it could be also computed differently. I'll add a proper note.

pzelasko and others added 3 commits September 29, 2022 12:01
…ent-workflow' into feature/torchaudio-forced-alignment-workflow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants