Skip to content

Conversation

@vimalmanohar
Copy link
Contributor

This PR supports more generic setup where existing utterances are segmented and fixes some bugs.

symbol_table=symbol_table)
for line in ctm_edits:
ctm_line = list(reco2file_and_channel[reco])
ctm_line = [reco, reco2file_and_channel[reco]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure this is right? looks like reco2file_and_channel[reco] is a tuple.

@@ -0,0 +1,52 @@
#! /usr/bin/perl

# Copyright 2016 Vimal Manohar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs documentation.

Usage: $0 [options] <model-dir> <lang> <data-in> [<text-in> <utt2text>] <segmented-data-out> <work-dir>
e.g.: $0 exp/wsj_tri2b data/lang_nosp data/train_long data/train_long/text data/train_reseg exp/segment_wsj_long_utts_train
This script performs segmentation of the data in <data-in> and
transcript <text-in>, writing the segmented data (with a segments file) to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this documentation does not read well, or does not agree with the usage line. Please check carefully.

Note: If <utt2text> is not provided, the text in <data-in> is used as the
original transcript.
If <utt2text>, if provided, is a mapping from the utterance in <data-in> to the
<original-transript> in <text-in>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you use the token original-transcript without defining what it is. I assume you refer to the key that is
the first field of text-in. Maybe "transcript-key" would be clearer, as it clarifies that it is a key to a
transcript and not an actual transcript.

# original directory which will be segmented using
# utils/data/subsegment_data_dir.sh.

(scalar @ARGV == 1) or die "Usage: fix_subsegment_feats.pl <utt2max_frames>";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_frames should be max-frames.

use warnings;

# This script modifies the feats ranges and ensures that they don't
# exceed the max number of frames supplied in utt2max_frames.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this documentation is not really sufficient. a "feats range" is not something most readers will be familiar with, you need to give an example; and you don't state that this script reads from the stdin or give any example of what utt2max-frames might look like or what stdin might look like, and how this script will change the stdin before it prints it out.

echo "$0: note: frame shift is $frame_shift [affects feats.scp]"


utils/data/get_utt2num_frames.sh --cmd "run.pl" --nj 1 $srcdir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document these few lines of code.

done > $dir/docs/source2tf_idf.scp


done | perl -ane 'BEGIN{ %tfidfs = (); } { if (!defined $tfidfs{$F[0]}) { $tfidfs{$F[0]} = $F[1]; } } END{while(my ($k, $v) = each %tfidfs) { print "$k $v\n"; }}' > $dir/docs/source2tf_idf.scp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break up this line-- doesn't have to be 80 chars but don't make it over 100 or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants