Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix load_kaldi_dara_dir not loading segments and feats.scp correctly #1005

Merged
merged 3 commits into from
Mar 22, 2023

Conversation

yasumori
Copy link

@yasumori yasumori commented Mar 21, 2023

Fix #987

  • I'd change the script to store mat_shape in utt_id_to_start_and_duration if calling kaldi_native_io.MatrixShape.read(ark) twice is an overhead.
  • Even nicer is probably to consolidate information from kaldi data files (segments, utt2spk, feats.scp) in a sort of data structure and then create Recordings, Supervisions and Features but I didn't have time to do this.
  • The reason why Supervisions need to inherit duration computed from a feature matrix is that the validation in https://github.com/lhotse-speech/lhotse/blob/master/lhotse/qa.py#LL298C3-L298C3 would fail if duration is inherited from segments.

@pzelasko
Copy link
Collaborator

Thanks! I think it looks good, but would be great if we could get a second pair of eyes that's more familiar with Kaldi compatibility code CC @desh2608 @jtrmal could one of you also check it?

@jtrmal
Copy link
Collaborator

jtrmal commented Mar 22, 2023 via email

@desh2608
Copy link
Collaborator

I think this is fine as a temporary solution, but as you mentioned, the kaldi.py script is getting a little cluttered, and too many if-else inside the functions. There are some issues we may run into in the future, such as: if the feature duration is very different from the segment duration (e.g. more than "off by 1") we may want to discard such segments, or trim either the segment or the feature, or whatever. Otherwise it could cause mismatch errors when trying to use the corresponding cuts. Such issues can be elegantly handled by creating an intermediate data structure as you mentioned, and providing some sensible defaults.

@desh2608
Copy link
Collaborator

Please fix the style issues.

Due to hard coded start time in load_kaldi_data_dir, cut sets created
from segments and feats.scp can have empty supervisions.

To fix this issue, resolve start time and duration of features when both
segments and feats.scp are available.

Supervisions inherit duration computed from feats.scp and Features
inherit start time from segments.
@jtrmal
Copy link
Collaborator

jtrmal commented Mar 22, 2023

looks ok but it would be add test case for the functionality

@pzelasko
Copy link
Collaborator

@yasumori could you add a unit test as well? We have some tests for Kaldi export/import so hopefully it wouldn't be excessively difficult.

@yasumori
Copy link
Author

@pzelasko Yes, I'll push a new commit adding a test soon

* test/fixtures/mini_librispeech2 is a subset of 3 utterances from
test/fixtures/mini_librispeech and contains mfcc feature arks for
testing.

* Start time of lbi-3536-23268-0000 is deliberately changed to 1.0 in
segments to test whether a Feature object inherits start time from
segments when feats.scp and segments are both available using
load_kaldi_data_dir.
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@pzelasko pzelasko merged commit 606bbb1 into lhotse-speech:master Mar 22, 2023
@pzelasko pzelasko added this to the v1.13 milestone Mar 22, 2023
@jtrmal
Copy link
Collaborator

jtrmal commented Mar 23, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

load_kaldi_data_dir with segments
4 participants