CSJ: Faithful Manifest #940

teowenshen · 2023-01-11T05:58:37Z

Fixes

Last segment of each recording session was left out in the supervision

Changes
The main motivation of these changes is to produce convert the raw .sdb transcripts of CSJ into lhotse SupervisionSegment as faithfully as possible. Depending on the user's use case, they can further process the resulting SupervisionSet as lhotse objects themselves, with minimal change to the lhotse CSJ recipe.

Removes config file and parses in disfluent mode by default
Adds per-character tag in SupervisionSegment.custom["disfluent_tag"]
Retains as much raw information as possible from the original transcript, including morphology information and kana pronunciation, so that parsing in other transcript modes can be done without referring back to the .sdb file
Produces 1 SupervisionSegment per utterance, without concatenation like in kaldi. Users can concatenate based on their preferred maximum utterance length or gap length.
- NOTE: lhotse's CutSet.merge_supervisions concatenates text with whitespace. This could cause problems if you are using per-character tokenization.
Includes a pre-defined validation set. Depending on your concatenation rules, the validation set is around 6 hours and 30 minutes.
Change transcript_dir into an optional argument. By not providing that argument, users can choose not to use the predefined list of train-valid-test split.

teowenshen · 2023-01-11T06:37:24Z

Before merging, I was hoping to check if anyone knows: do the Mandarin recipes add spaces in between each character, each word, or no spaces.

I notice that some objects in Icefall, like MmiTrainingGraphCompiler, assume that there are spaces between each token (character), but I remember the raw transcript has spaces in between words, not characters.

danpovey · 2023-01-11T07:02:27Z

That MMI training method is not very often used now, it's probably not a big deal.

teowenshen · 2023-01-11T09:08:56Z

Sure. Thanks for the assurance. I will remain the default text field of this recipe without spaces. The space-annotated text is still available under custom['raw'].

In that case, this PR is ready for merge from my side.

pzelasko · 2023-01-11T19:35:11Z

Thanks for the contribution, since you're the original author of the recipe, I'll just merge it. In case you need CutSet.merge_supervisions() not to include the whitespace, feel free to make a PR that makes it optional (e.g. add an arg text_sep=' ' that can be set to an empty string '' instead)

teowenshen · 2023-01-12T07:34:03Z

Great. But, I plan to tidy up my branch for the Icefall recipe first, so that problems in this new PR can be identified before merging.

I thought about modifying merge_supervisions(), but in my particular use-case, the speaker, gender, segment id etc. of the individual cuts are guaranteed to be the same, so I don't need to concatenate those fields. Eventually the modification became somewhat involved, so I ended up adding a utility function concat_csj_supervisions() which I expect users to import or refer.

However, if you do find the customisable sep argument to be useful, I can add it in another PR.

teowenshen · 2023-02-09T07:41:23Z

Hi there, sorry for the long inactivity! This branch is ready for merge on my side now.

pzelasko · 2023-02-09T13:41:52Z

Thanks! LGTM.

QDPeng · 2023-02-17T02:55:24Z

How do I download the CSJ data?
I have logged in and registered the free version，website: https://clrd.ninjal.ac.jp/csj/en/index.html
But when I login ,I can't find the download button。It's a search page。

teowenshen added 3 commits January 10, 2023 15:51

Parse sdb faithfully

e9135f8

add no split option

243723f

fix doc

a6981d9

teowenshen added 3 commits January 11, 2023 22:14

add concat supervision utility

ab4eeb1

CSJSDBParser defaults to no separator

82e0f59

fix unsync-ed tag and text in parser

25f5472

teowenshen changed the title ~~CSJ: Faithful Manifest~~ [WIP] CSJ: Faithful Manifest Jan 12, 2023

Auto disfluent parsing at first run

d9bdc88

Merge branch 'master' into faithful_sp

549610a

pzelasko merged commit 3768d80 into lhotse-speech:master Feb 9, 2023

pzelasko added this to the v1.13 milestone Feb 9, 2023

teowenshen changed the title ~~[WIP] CSJ: Faithful Manifest~~ CSJ: Faithful Manifest Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSJ: Faithful Manifest #940

CSJ: Faithful Manifest #940

teowenshen commented Jan 11, 2023

teowenshen commented Jan 11, 2023 •

edited

Loading

danpovey commented Jan 11, 2023

teowenshen commented Jan 11, 2023

pzelasko commented Jan 11, 2023 •

edited

Loading

teowenshen commented Jan 12, 2023

teowenshen commented Feb 9, 2023

pzelasko commented Feb 9, 2023

QDPeng commented Feb 17, 2023

CSJ: Faithful Manifest #940

CSJ: Faithful Manifest #940

Conversation

teowenshen commented Jan 11, 2023

teowenshen commented Jan 11, 2023 • edited Loading

danpovey commented Jan 11, 2023

teowenshen commented Jan 11, 2023

pzelasko commented Jan 11, 2023 • edited Loading

teowenshen commented Jan 12, 2023

teowenshen commented Feb 9, 2023

pzelasko commented Feb 9, 2023

QDPeng commented Feb 17, 2023

teowenshen commented Jan 11, 2023 •

edited

Loading

pzelasko commented Jan 11, 2023 •

edited

Loading