-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update mgb2.py #725
Update mgb2.py #725
Conversation
extra new line causes error with load_kaldi_data_dir()
Should |
Yes, I think so. The way it works now is that It assumes that each line in the wav will be split into list of two items [id, text]. The additional empty line adds third item and causes an error. One can also fix this issue from the |
Fixing unicode string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don’t we fix Kaldi utilities to skip empty lines instead?
@@ -130,7 +130,7 @@ def prepare_mgb2( | |||
) as f_out: | |||
for line in f_in: | |||
f_out.write(line.replace("wav/", f"{corpus_dir}/{part}/wav/")) | |||
f_out.write("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn’t the script write everything to a single line after this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lines read from the wav.scp file already contain a new line, so the additional "\n" adds empty line. However I agree that it will be cleaner to just filter out the empty lines from any files that is being read from kaldi folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Dan pointed out, Kaldi doesn’t allow empty lines, so let’s strive to produce correct outputs. Please fix style issues and LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The empty line was coming from the line f_out.write("\n")
which I removed now. I tested it and now it works perfectly with load_kaldi_data_dir().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Please fix black issues and LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Kaldi data dirs have a well defined format that does not allow empty lines. |
extra data cleaning
Thanks! |
extra new line causes error with load_kaldi_data_dir()