-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSJ: Faithful Manifest #940
Conversation
Before merging, I was hoping to check if anyone knows: do the Mandarin recipes add spaces in between each character, each word, or no spaces. I notice that some objects in Icefall, like |
That MMI training method is not very often used now, it's probably not a big deal. |
Sure. Thanks for the assurance. I will remain the default text field of this recipe without spaces. The space-annotated text is still available under In that case, this PR is ready for merge from my side. |
Thanks for the contribution, since you're the original author of the recipe, I'll just merge it. In case you need |
Great. But, I plan to tidy up my branch for the Icefall recipe first, so that problems in this new PR can be identified before merging. I thought about modifying However, if you do find the customisable |
Hi there, sorry for the long inactivity! This branch is ready for merge on my side now. |
Thanks! LGTM. |
How do I download the CSJ data? |
Fixes
Changes
The main motivation of these changes is to produce convert the raw
.sdb
transcripts of CSJ into lhotse SupervisionSegment as faithfully as possible. Depending on the user's use case, they can further process the resulting SupervisionSet as lhotse objects themselves, with minimal change to the lhotse CSJ recipe.SupervisionSegment.custom["disfluent_tag"]
.sdb
fileSupervisionSegment
per utterance, without concatenation like in kaldi. Users can concatenate based on their preferred maximum utterance length or gap length.CutSet.merge_supervisions
concatenates text with whitespace. This could cause problems if you are using per-character tokenization.transcript_dir
into an optional argument. By not providing that argument, users can choose not to use the predefined list of train-valid-test split.