Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSJ: Faithful Manifest #940

Merged
merged 8 commits into from
Feb 9, 2023
Merged

Conversation

teowenshen
Copy link
Contributor

Fixes

  • Last segment of each recording session was left out in the supervision

Changes
The main motivation of these changes is to produce convert the raw .sdb transcripts of CSJ into lhotse SupervisionSegment as faithfully as possible. Depending on the user's use case, they can further process the resulting SupervisionSet as lhotse objects themselves, with minimal change to the lhotse CSJ recipe.

  • Removes config file and parses in disfluent mode by default
  • Adds per-character tag in SupervisionSegment.custom["disfluent_tag"]
  • Retains as much raw information as possible from the original transcript, including morphology information and kana pronunciation, so that parsing in other transcript modes can be done without referring back to the .sdb file
  • Produces 1 SupervisionSegment per utterance, without concatenation like in kaldi. Users can concatenate based on their preferred maximum utterance length or gap length.
    • NOTE: lhotse's CutSet.merge_supervisions concatenates text with whitespace. This could cause problems if you are using per-character tokenization.
  • Includes a pre-defined validation set. Depending on your concatenation rules, the validation set is around 6 hours and 30 minutes.
  • Change transcript_dir into an optional argument. By not providing that argument, users can choose not to use the predefined list of train-valid-test split.

@teowenshen
Copy link
Contributor Author

teowenshen commented Jan 11, 2023

Before merging, I was hoping to check if anyone knows: do the Mandarin recipes add spaces in between each character, each word, or no spaces.

I notice that some objects in Icefall, like MmiTrainingGraphCompiler, assume that there are spaces between each token (character), but I remember the raw transcript has spaces in between words, not characters.

@danpovey
Copy link
Collaborator

That MMI training method is not very often used now, it's probably not a big deal.

@teowenshen
Copy link
Contributor Author

Sure. Thanks for the assurance. I will remain the default text field of this recipe without spaces. The space-annotated text is still available under custom['raw'].

In that case, this PR is ready for merge from my side.

@pzelasko
Copy link
Collaborator

pzelasko commented Jan 11, 2023

Thanks for the contribution, since you're the original author of the recipe, I'll just merge it. In case you need CutSet.merge_supervisions() not to include the whitespace, feel free to make a PR that makes it optional (e.g. add an arg text_sep=' ' that can be set to an empty string '' instead)

@teowenshen teowenshen changed the title CSJ: Faithful Manifest [WIP] CSJ: Faithful Manifest Jan 12, 2023
@teowenshen
Copy link
Contributor Author

Great. But, I plan to tidy up my branch for the Icefall recipe first, so that problems in this new PR can be identified before merging.

I thought about modifying merge_supervisions(), but in my particular use-case, the speaker, gender, segment id etc. of the individual cuts are guaranteed to be the same, so I don't need to concatenate those fields. Eventually the modification became somewhat involved, so I ended up adding a utility function concat_csj_supervisions() which I expect users to import or refer.

However, if you do find the customisable sep argument to be useful, I can add it in another PR.

@teowenshen
Copy link
Contributor Author

Hi there, sorry for the long inactivity! This branch is ready for merge on my side now.

@pzelasko
Copy link
Collaborator

pzelasko commented Feb 9, 2023

Thanks! LGTM.

@pzelasko pzelasko merged commit 3768d80 into lhotse-speech:master Feb 9, 2023
@pzelasko pzelasko added this to the v1.13 milestone Feb 9, 2023
@teowenshen teowenshen changed the title [WIP] CSJ: Faithful Manifest CSJ: Faithful Manifest Feb 9, 2023
@QDPeng
Copy link

QDPeng commented Feb 17, 2023

How do I download the CSJ data?
I have logged in and registered the free version,website: https://clrd.ninjal.ac.jp/csj/en/index.html
But when I login ,I can't find the download button。It's a search page。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants