Fix to long recording segmentation #1639

vimalmanohar · 2017-05-23T04:14:05Z

This PR supports more generic setup where existing utterances are segmented and fixes some bugs.

danpovey · 2017-05-23T04:27:19Z

egs/wsj/s5/steps/cleanup/internal/align_ctm_ref.py

                                          symbol_table=symbol_table)
                for line in ctm_edits:
-                    ctm_line = list(reco2file_and_channel[reco])
+                    ctm_line = [reco, reco2file_and_channel[reco]]


are you sure this is right? looks like reco2file_and_channel[reco] is a tuple.

danpovey · 2017-05-23T04:27:54Z

egs/wsj/s5/steps/cleanup/internal/ctm_to_text.pl

@@ -0,0 +1,52 @@
+#! /usr/bin/perl
+
+# Copyright 2016  Vimal Manohar


needs documentation.

danpovey · 2017-05-23T04:29:53Z

egs/wsj/s5/steps/cleanup/segment_long_utterances.sh

+Usage: $0 [options] <model-dir> <lang> <data-in> [<text-in> <utt2text>] <segmented-data-out> <work-dir>
 e.g.: $0 exp/wsj_tri2b data/lang_nosp data/train_long data/train_long/text data/train_reseg exp/segment_wsj_long_utts_train
 This script performs segmentation of the data in <data-in> and 
 transcript <text-in>, writing the segmented data (with a segments file) to


this documentation does not read well, or does not agree with the usage line. Please check carefully.

danpovey · 2017-05-23T04:31:08Z

egs/wsj/s5/steps/cleanup/segment_long_utterances.sh

+Note: If <utt2text> is not provided, the text in <data-in> is used as the 
+original transcript.
+If <utt2text>, if provided, is a mapping from the utterance in <data-in> to the 
+<original-transript> in <text-in>.


you use the token original-transcript without defining what it is. I assume you refer to the key that is
the first field of text-in. Maybe "transcript-key" would be clearer, as it clarifies that it is a key to a
transcript and not an actual transcript.

danpovey · 2017-05-23T04:35:01Z

egs/wsj/s5/utils/data/fix_subsegment_feats.pl

+# original directory which will be segmented using 
+# utils/data/subsegment_data_dir.sh.
+
+(scalar @ARGV == 1) or die "Usage: fix_subsegment_feats.pl <utt2max_frames>";


max_frames should be max-frames.

danpovey · 2017-05-23T04:36:06Z

egs/wsj/s5/utils/data/fix_subsegment_feats.pl

+use warnings;
+
+# This script modifies the feats ranges and ensures that they don't 
+# exceed the max number of frames supplied in utt2max_frames.


this documentation is not really sufficient. a "feats range" is not something most readers will be familiar with, you need to give an example; and you don't state that this script reads from the stdin or give any example of what utt2max-frames might look like or what stdin might look like, and how this script will change the stdin before it prints it out.

danpovey · 2017-05-23T04:38:05Z

egs/wsj/s5/utils/data/subsegment_data_dir.sh

  echo "$0: note: frame shift is $frame_shift [affects feats.scp]"
-
+
+  utils/data/get_utt2num_frames.sh --cmd "run.pl" --nj 1 $srcdir


please document these few lines of code.

danpovey · 2017-05-23T04:38:52Z

egs/wsj/s5/steps/cleanup/segment_long_utterances.sh

-  done > $dir/docs/source2tf_idf.scp
-
-
+  done | perl -ane 'BEGIN{ %tfidfs = (); } { if (!defined $tfidfs{$F[0]}) { $tfidfs{$F[0]} = $F[1]; } } END{while(my ($k, $v) = each %tfidfs) { print "$k $v\n"; }}' > $dir/docs/source2tf_idf.scp


break up this line-- doesn't have to be 80 chars but don't make it over 100 or so.

vimalmanohar added 6 commits May 18, 2017 13:58

long_utts: Minor fix

58fcd61

Merge branch 'master' of github.com:kaldi-asr/kaldi

df90bba

Merge branch 'master' of github.com:kaldi-asr/kaldi

ff4ac04

Merge branch 'master' of github.com:kaldi-asr/kaldi

29ced2a

Merge branch 'master' of github.com:kaldi-asr/kaldi

0f69bbd

long_utts: Fixing bugs and making scripts more general

257c54f

danpovey reviewed May 23, 2017

View reviewed changes

long_utts: Fixing based on comments

bda1c78

vimalmanohar mentioned this pull request May 23, 2017

segment_long_utterances.sh failing on decode_segmentation #1629

Open

danpovey merged commit d6cf1bd into kaldi-asr:master May 23, 2017

Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018

[scripts,egs] Fixes to long-recording segmentation (kaldi-asr#1639)

5576645

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix to long recording segmentation #1639

Fix to long recording segmentation #1639

Uh oh!

vimalmanohar commented May 23, 2017

Uh oh!

danpovey May 23, 2017

Uh oh!

danpovey May 23, 2017

Uh oh!

danpovey May 23, 2017

Uh oh!

danpovey May 23, 2017

Uh oh!

danpovey May 23, 2017

Uh oh!

danpovey May 23, 2017

Uh oh!

danpovey May 23, 2017

Uh oh!

danpovey May 23, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,52 @@
		#! /usr/bin/perl

		# Copyright 2016 Vimal Manohar

		echo "$0: note: frame shift is $frame_shift [affects feats.scp]"


		utils/data/get_utt2num_frames.sh --cmd "run.pl" --nj 1 $srcdir

		done > $dir/docs/source2tf_idf.scp


		done \| perl -ane 'BEGIN{ %tfidfs = (); } { if (!defined $tfidfs{$F[0]}) { $tfidfs{$F[0]} = $F[1]; } } END{while(my ($k, $v) = each %tfidfs) { print "$k $v\n"; }}' > $dir/docs/source2tf_idf.scp

Fix to long recording segmentation #1639

Fix to long recording segmentation #1639

Uh oh!

Conversation

vimalmanohar commented May 23, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants