-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MMS forced alignment backend #1185
MMS forced alignment backend #1185
Conversation
e0fbc30
to
39f89c6
Compare
@desh2608 @pzelasko seems to be ready for review. Caveats:
|
Thanks! I'll review this more carefully later. I think it's a good idea to separate out the parallelization logic to |
@pzelasko I can put my class for parallel processing in an external module. Suggest where, I'll set aside time this weekend to open a new PR. As part of the PR, let's discuss my proposed changes and add this functionality to lhotse |
@pzelasko I did the refactor, please take a look. |
2676d4a
to
8f86efe
Compare
Also added the word tokenization libraries for Myanmar and Khmer. To the best of my knowledge, Lao is the only remaining major language which does not use spaces to divide text into words. I could not find a ready-to-use lib for Lao word tokenization, even though local guys did some research on automatizing it. I guess it would be up to some Lao contributors to add a Python library when the need arises. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to comments below, it looks like this workflow depends on package language_data
, we should check somewhere if its importable and prompt the user to install it.
Also I'm not sure if the language auto-detection works correctly, I tried to align mini LibriSpeech as a test, and got the following:
lhotse workflows align-with-torchaudio -n MMS_FA -d mps libri-train-5.jsonl.gz aligned-mms.jsonl.gz
Aligning: 0%| | 0/1519 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/bin/lhotse", line 33, in <module>
sys.exit(load_entry_point('lhotse', 'console_scripts', 'lhotse')())
File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/pzelasko/meaning/lhotse/lhotse/bin/modes/workflows.py", line 168, in align_with_torchaudio
for cut in tqdm(
File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/Users/pzelasko/meaning/lhotse/lhotse/parallel.py", line 115, in __call__
yield runner(item, **kwargs)
File "/Users/pzelasko/meaning/lhotse/lhotse/workflows/forced_alignment/base.py", line 54, in __call__
self.normalize_text(sup.text, language=sup.language)
File "/Users/pzelasko/meaning/lhotse/lhotse/workflows/forced_alignment/mms_aligner.py", line 48, in normalize_text
romanized_words = self._uroman(sep.join(orig_words), language=language).split(
File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/uroman/__init__.py", line 14, in uroman
language = Language.get(language).to_alpha3()
File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/langcodes/__init__.py", line 304, in get
components = parse_tag(tag)
File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/langcodes/tag_parser.py", line 212, in parse_tag
subtag_error(subtags[0], 'a language code')
File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/langcodes/tag_parser.py", line 422, in subtag_error
raise LanguageTagError(f"Expected {expected}, got {subtag!r}")
langcodes.tag_parser.LanguageTagError: Expected a language code, got 'english'
""" | ||
A class which uses ProcessPoolExecutor to parallelize the execution of a callable class. | ||
The instances of the runner class are instantiated separately in each worker process. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coud you add an example of the usage in this doc? This API is not self-explanatory to me.
|
||
|
||
class MMSForcedAligner(ForcedAligner): | ||
def __init__(self, bundle_name: str, device: str = "cpu"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove bundle_name
from param list and hardcode self.bundle_name = "MMS_FSA"
, this will allso allow to remove the assertion below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider option check_language: bool = True
which would warn the users about missing language
field in the supervisions if detected (the message should mention how to disable the warnings as well).
- https://pytorch.org/audio/stable/pipelines.html | ||
|
||
:param cuts: input CutSet. | ||
:param bundle_name: name of the selected pretrained model from torchaudio. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add documentation about MMS here and in the CLI (lhotse/bin/modes/workflows.py
under align_with_torchaudio
)? Ideally a few words about how to enable it.
bundle_name: str = "WAV2VEC2_ASR_BASE_960H", | ||
device: str = "cpu", | ||
normalize_text: bool = True, | ||
num_jobs: int = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This param needs to be exposed in the CLI (ideally also add a check that when num_jobs > 1, device == 'cpu')
pre_alignment = self.align(audio, transcript) | ||
except FailedToAlign: | ||
logging.info( | ||
f"Failed to align supervision '{sup.id}' for cut '{cut.id}'. Writing it without alignment." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to write out the original exception details here as well; e.g. I tried to turn on MPS on MacOS and got only generic "failed to align", but it worked OK with CPU; I wouldn't know why if I didn't suspect already.
Otherwise it seems to work well, great work @flyingleafe! |
@pzelasko I accounted for your comments, please check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Closes #1120
Note: MMS forced aligner uses the romanized text as input, and uses uroman for romanization. The latter is written in Perl. I had to make a wrapper package for it to avoid cringy direct downloading the of original Perl scripts somewhere. The wrapper package still calls
perl
in a subprocess though, which gives a significant overhead. I wonder if portinguroman
fully to Python is a worthwhile effort.Note #2: I did not change the default
bundle_name
for backwards-compatibility.TODO list: