-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[workflow] Word-level forced alignment with pretrained models from Torchaudio #827
[workflow] Word-level forced alignment with pretrained models from Torchaudio #827
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super-fast development on this one! LGTM, in general.
Regarding alignments, I agree it's not very efficient. Perhaps treating them in the same way as Features would be a better option.
Here's one approach off the top of my head. We can have an
|
…ent-workflow' into feature/torchaudio-forced-alignment-workflow
lhotse/supervision.py
Outdated
""" | ||
|
||
symbol: str | ||
start: Seconds | ||
duration: Seconds | ||
score: Optional[float] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the score indicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's aligner-specific "confidence score", in case of torchaudio they suggested to provide the average token probability for a given segment in their tutorial; but it could be also computed differently. I'll add a proper note.
Co-authored-by: Fangjun Kuang <[email protected]>
…ent-workflow' into feature/torchaudio-forced-alignment-workflow
As a follow-up to #824 I'm adding another workflow for forced word-level alignment. It may be interesting to combine the two workflows together. In the end I used the torchaudio model because the tutorial example looks quite convincing with regard to word timestampt accuracy, but I did not evaluate if it's better or worse than any other approach. We can add other workflows for forced alignment later to create some choice.
@desh2608 I also refactored the AlignmentItem thing a bit so that it's NamedTuple again, for the sake of efficiency. I don't like this approach though, after using it for some time. I think we should eventually change it to something like a numpy array (or a collection of them) and maybe not store them in the manifest. Anyway, at least the size of the manifests is greatly reduced with the NamedTuple approach.