HUB4 English Broadcast News recipe #2027

vimalmanohar · 2017-11-21T17:47:25Z

Basic GMM recipe for English broadcast news. Needs to be tested.
@xiaohui-zhang Please review it.

jtrmal · 2017-11-21T17:54:06Z

Hi, why you are not calling it hub4_english or so that it correspond with the other names? I think bn is little bit too abstract Y.

…

On Nov 21, 2017 12:49 PM, "Vimal Manohar" ***@***.***> wrote: Basic GMM recipe for English broadcast news. Needs to be tested. @xiaohui-zhang <https://github.com/xiaohui-zhang> Please review it. ------------------------------ You can view, comment on, or merge this pull request online at: #2027 Commit Summary - [egs] Bug fix in train_raw_dnn.py - steps/cleanup: Fixed corner case in resolve_ctm_edits_overlaps.py - Merge branch 'master' of github.com:kaldi-asr/kaldi - Merge branch 'master' of github.com:kaldi-asr/kaldi - bn: Adding BN recipe - bn: Add 1999 BN eval preparation - bn: Add more data preparation scripts - bn: Fix MFCC config - bn: Clean and update recipe - bn: Remove local/lm/text_normalization.py - bn: Fix normalize_transcripts - bn: Updated recipe to add more LM corpora - bn: Updating main recipe - bn: Minor fixes in BN recipe - HUB4 train preparation scripts - bn: Bug fixes and create new scripts - Minor modifications - bn: Modifying some preparation scripts - bn: Updating bn recipe - bn: Remove some unused scripts File Changes - *A* egs/bn/s5/README <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-0> (33) - *A* egs/bn/s5/cmd.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-1> (14) - *A* egs/bn/s5/conf/mfcc.conf <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-2> (1) - *A* egs/bn/s5/conf/vad.conf <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-3> (2) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/INVENTORY <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-4> (56) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/README <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-5> (34) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/abbrlist <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-6> (2403) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/abbrproc.perl <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-7> (465) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/addressforms <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-8> (38) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/artfilter.perl <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-9> (83) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/bugproc.perl <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-10> (69) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/do-lm <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-11> (43) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/eval-material.ptrns <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-12> (4) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/num_excp <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-13> (528) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/numhack.perl <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-14> (80) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/numproc.perl <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-15> (1133) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/pare-sgml.perl <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-16> (36) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/progsummary.perl <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-17> (44) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/puncproc.perl <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-18> (196) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/sent-init.vocab <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-19> (411) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/sentag.c <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-20> (674) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/tr-bn-char.fast.perl <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-21> (13) - *A* egs/bn/s5/local/data_prep/csr_hub4_utils/tr-bn-char.slow.perl <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-22> (46) - *A* egs/bn/s5/local/data_prep/do-lm-csr96 <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-23> (40) - *A* egs/bn/s5/local/data_prep/format_1996_bn_data.pl <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-24> (131) - *A* egs/bn/s5/local/data_prep/format_1997_bn_data.pl <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-25> (1) - *A* egs/bn/s5/local/data_prep/hub4_utils.py <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-26> (153) - *A* egs/bn/s5/local/data_prep/parse_sgm_1996_hub4_eng.pl <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-27> (229) - *A* egs/bn/s5/local/data_prep/parse_sgm_1997_hub4_eng.pl <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-28> (228) - *A* egs/bn/s5/local/data_prep/prepare_1995_csr_hub4_corpus.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-29> (69) - *A* egs/bn/s5/local/data_prep/prepare_1996_bn_data.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-30> (44) - *A* egs/bn/s5/local/data_prep/prepare_1996_csr_hub4_lm_corpus.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-31> (58) - *A* egs/bn/s5/local/data_prep/prepare_1996_hub4_bn_eng_dev_ and_eval.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-32> (99) - *A* egs/bn/s5/local/data_prep/prepare_1997_bn_data.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-33> (44) - *A* egs/bn/s5/local/data_prep/prepare_1997_hub4_bn_eng_eval.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-34> (64) - *A* egs/bn/s5/local/data_prep/prepare_1998_hub4_bn_eng_eval.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-35> (59) - *A* egs/bn/s5/local/data_prep/prepare_1999_hub4_bn_eng_eval.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-36> (72) - *A* egs/bn/s5/local/data_prep/prepare_na_news_text_corpus.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-37> (54) - *A* egs/bn/s5/local/data_prep/prepare_na_news_text_supplement.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-38> (61) - *A* egs/bn/s5/local/data_prep/process_1995_bn_annotation.py <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-39> (273) - *A* egs/bn/s5/local/data_prep/process_1996_csr_hub4_lm_filelist.py <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-40> (164) - *A* egs/bn/s5/local/data_prep/process_na_news_text.py <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-41> (114) - *A* egs/bn/s5/local/dict <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-42> (1) - *A* egs/bn/s5/local/format_data.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-43> (122) - *A* egs/bn/s5/local/format_lms.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-44> (48) - *A* egs/bn/s5/local/lm/merge_word_counts.py <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-45> (30) - *A* egs/bn/s5/local/normalize_transcripts.pl <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-46> (28) - *A* egs/bn/s5/local/prepare_dict.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-47> (213) - *A* egs/bn/s5/local/run_cleanup_segmentation.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-48> (102) - *A* egs/bn/s5/local/score.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-49> (1) - *A* egs/bn/s5/local/score_sclite.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-50> (112) - *A* egs/bn/s5/local/train_lm.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-51> (257) - *A* egs/bn/s5/local/tuning/run_segmentation_wsj_a.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-52> (420) - *A* egs/bn/s5/local/tuning/run_segmentation_wsj_b.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-53> (346) - *A* egs/bn/s5/local/tuning/run_segmentation_wsj_d.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-54> (425) - *A* egs/bn/s5/local/tuning/run_segmentation_wsj_e.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-55> (421) - *A* egs/bn/s5/local/tuning/run_segmentation_wsj_f.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-56> (421) - *A* egs/bn/s5/path.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-57> (8) - *A* egs/bn/s5/run.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-58> (170) - *A* egs/bn/s5/steps <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-59> (1) - *A* egs/bn/s5/utils <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-60> (1) - *M* egs/wsj/s5/steps/cleanup/internal/align_ctm_ref.py <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-61> (12) - *M* egs/wsj/s5/utils/data/get_utt2dur.sh <https://github.com/kaldi-asr/kaldi/pull/2027/files#diff-62> (20) Patch Links: - https://github.com/kaldi-asr/kaldi/pull/2027.patch - https://github.com/kaldi-asr/kaldi/pull/2027.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2027>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisX8U9D_wM7RVcAsQelTLFsrH5Z-nsks5s4w0ngaJpZM4QmPO5> .

vimalmanohar · 2017-11-21T17:59:40Z

I'll change the name to hub4_english

xiaohui-zhang · 2017-11-21T20:16:43Z

egs/hub4_english/s5/local/data_prep/csr_hub4_utils/abbrproc.perl

+		}
+		next;
+	    }
+


remove empty line

xiaohui-zhang · 2017-11-21T20:17:03Z

egs/hub4_english/s5/local/data_prep/csr_hub4_utils/abbrproc.perl

+						else {&pusho($back);}
+					}
+					next;
+


remove empty line

jtrmal · 2017-11-21T20:21:52Z

Guys, let's wait with those csr_hub4_utils <#2027 (comment)> files -- I think VImal included them from the corpus, so I'm not completely convinced we can include them (licensing and IP issues) If there were modifications, then I think we should just publish a patch. @danpovey? y.

…

On Tue, Nov 21, 2017 at 3:17 PM, Xiaohui Zhang ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/hub4_english/s5/local/data_prep/csr_hub4_utils/abbrproc.perl <#2027 (comment)>: > + + if($field<$#input || $back =~ /[!?]/) + { $back =~ s/^\.//; } # rm . + else # end of sent + { $back =~ s/^\.(\'s)/$1./; + if($back =~ /\..*\./) # 2 dots + {$back=~s/\.([^\.]*)/$1/;} + } + + if($back) + { if($back !~ /^[\w]/) + {&appendo($back);} + else {&pusho($back);} + } + next; + remove empty line — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2027 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisX--iVDrOJvF7Pe8ThXXGmHu1Gjrkks5s4y_HgaJpZM4QmPO5> .

xiaohui-zhang · 2017-11-21T20:21:34Z

egs/hub4_english/s5/local/data_prep/prepare_1995_csr_hub4_corpus.sh

+#local/data_prep/parse_sgm_1995_csr_hub4.pl $dir/dev95_text.list > $dir/dev95_transcripts.txt 2> $dir/parse_sgml_dev95.log || exit 1
+#local/data_prep/parse_sgm_1995_csr_hub4.pl $dir/eval95_test.list > $dir/eval95_transcripts.txt 2> $dir/parse_sgml_eval95.log || exit 1
+#
+#exit 0


maybe remove those commented lines or explain what's the purpose of them?

danpovey · 2018-01-31T23:25:40Z

is there a reason this is still WIP?

vimalmanohar · 2018-02-08T03:57:22Z

This is ready for commit.

danpovey · 2018-02-08T21:06:41Z

egs/hub4_english/s5/run.sh

+# audio recordings of HUB4 with raw unaligned transcripts into short segments
+# with aligned transcripts for training new ASR models.
+
+# First run the data preparation stages in WSJ run.sh


The script from here seems like it would be better as part of local/run_segmentation_wsj.sh. E.g. wsj_base could be made anargument to that script. And all this could be commented out. Unless I misunderstand something. I assume this isn't actually needed by the rest of this script, or it would be at the top...?

danpovey · 2018-02-08T21:18:00Z

egs/wsj/s5/steps/cleanup/internal/get_ctm.sh

-# This is similar to get_ctm.sh, but gets the
-# CTM at the utterance-level.
-
+# This is similar to get_ctm.sh, but gets the CTM at the utterance-level.


This script seems like it's not too closely related to the cleanup scripts and could be moved to steps/get_ctm_fast.sh. Clarify that it differs from get_ctm.sh because it only uses a single lmwt, operates in parallel over multiple lattices, outputs the ctm to a separate directory, and doesn't have the option to convert the ctm from an utterance-level ctm to a per-recording ctm.

You should make sure there is a suitable "see also" in get_ctm.sh.

danpovey · 2018-02-08T21:28:42Z

egs/wsj/s5/utils/data/internal/combine_segments_to_recording.py

@@ -0,0 +1,70 @@
+#!/usr/bin/env python


For new python scripts we prefer python3. (and you can remove compatibility things). This will mean it's not necessary to verify separately that it's python3 compatible.

As discussed, we'll remove the text-manipulating aspects of this script and replace them with a separate call to apply_map.pl.
Unfortunately the way the text formats for Kaldi are designed don't mesh well with python3's default 'str' type. The "text" file format is encoding-agnostic: any encoding is allowed, as long as (when interpreted as a bytestream) whitespace characters represent space, and newline represents newline. And because it's never needed by Kaldi itself, we never have to specify the encoding.
It might actually be possible to deal with this in python3 using its 'byte-string' type. From a search it looks like it is possible to do things like splitting byte-strings. This might be a good test case for figuring out the right way to do this in python, because as python3 becomes more and more the default, we need to make sure it doesn't break things.

vimalmanohar · 2018-02-08T22:32:11Z

I get encoding issues in python3, which I don't get in python2. There is some text with non-ascii characters, but since LC_ALL=C is used in general, the file is said to be in ascii. Python3 tries to then read the file in ascii and fails. Vimal

On Thu, Feb 8, 2018 at 4:28 PM Daniel Povey ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/wsj/s5/utils/data/internal/combine_segments_to_recording.py <#2027 (comment)>: > @@ -0,0 +1,70 @@ +#!/usr/bin/env python For new python scripts we prefer python3. (and you can remove compatibility things). This will mean it's not necessary to verify separately that it's python3 compatible. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2027 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEATV8cxePZr9Eke19qbx14AXVOKTEtLks5tS2cTgaJpZM4QmPO5> .

-- Vimal Manohar PhD Student Electrical & Computer Engineering Johns Hopkins University

danpovey · 2018-02-08T22:40:01Z

OK, well we 100% need to be compatible with python3. The way it's done in python3 may be different but you'll have to ask around; I don't know. Dan On Thu, Feb 8, 2018 at 5:32 PM, Vimal Manohar <[email protected]> wrote:

…

I get encoding issues in python3, which I don't get in python2. There is some text with non-ascii characters, but since LC_ALL=C is used in general, the file is said to be in ascii. Python3 tries to then read the file in ascii and fails. Vimal On Thu, Feb 8, 2018 at 4:28 PM Daniel Povey ***@***.***> wrote: > ***@***.**** commented on this pull request. > ------------------------------ > > In egs/wsj/s5/utils/data/internal/combine_segments_to_recording.py > <#2027 (comment)>: > > > @@ -0,0 +1,70 @@ > +#!/usr/bin/env python > > For new python scripts we prefer python3. (and you can remove > compatibility things). This will mean it's not necessary to verify > separately that it's python3 compatible. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#2027 (review) >, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ AEATV8cxePZr9Eke19qbx14AXVOKTEtLks5tS2cTgaJpZM4QmPO5> > . > -- Vimal Manohar PhD Student Electrical & Computer Engineering Johns Hopkins University — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2027 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuzvNbt03RG-3uuDQDF_v_IkNbmV4ks5tS3XvgaJpZM4QmPO5> .

danpovey · 2018-02-09T03:05:38Z

egs/hub4_english/s5/run.sh

+# audio recordings of HUB4 with raw unaligned transcripts into short segments
+# with aligned transcripts for training new ASR models.
+
+# local/run_segmentation_wsj.sh


You have set this up like a "tuning" directory, with _a.sh and _b.sh and a soft link. But after looking at the scripts, I don't think this setup makes sense because they are for two different scenarios. Perhaps you could call them e.g.
local/run_segmentation_wsj_unsup.sh local/run_segmentation_wsj_sup.sh,
have commented-out invocations to both of them in the run.sh, and explain in the run.sh and in the respective scripts how they differ.

vimalmanohar · 2018-02-11T22:42:50Z

This is ready to be merged.

danpovey

We're making progress. Some more smallish comments.

danpovey · 2018-02-12T03:31:15Z

egs/wsj/s5/steps/dict/train_g2p.sh

 if $only_words && [ ! -z "$silence_phones" ]; then
  awk 'NR==FNR{a[$1] = 1; next} {s=$2;for(i=3;i<=NF;i++) s=s" "$i;a[$1]=s;if(!(s in a)) print $1" "s}' \
-    $silence_phones > $wdir/lexicon_onlywords.txt
+    $lexicon $silence_phones > $wdir/lexicon_onlywords.txt


I know this awk script was pre-existing, but it's rather ugly. Can you rewrite it to make it clearer? I prefer perl or awk to avoid a super-long python script. Also the script seems to remove words where the pronunciation was seen before, which I doubt was the intention. And it's not clear to me that the order of 'lexicon' vs. 'silence_phones' as you have it, is correct anyway. Bear in mind that it's possible that there could be multiple silence phones on a single line, in the $silence_phones file, which I think this script does not handle properly.
You can use the following as inspiration (reading from a file directly is less ugly than the FNR trick):

# Get training data with OOV words (w.r.t. our current vocab) replaced with <unk>, # as well as adding </s> symbols at the end of each sentence cat $train_text | awk -v w=$dir/wordlist.all \ 'BEGIN{while((getline<w)>0) v[$1]=1;} {for (i=2;i<=NF;i++) if ($i in v) printf $i" ";else printf "<unk> ";print ""}' | sed 's=$= </s>=g' \ | utils/shuffle_list.pl | gzip -c > $dir/all.gz

@xiaohui-zhang would have idea about this part.

@danpovey do we allow multiple silence phones to occur in the same row in data/local/dict/silence_phones.txt ? I thought they have to be put in different columns.

danpovey · 2018-02-12T03:31:49Z

egs/wsj/s5/utils/data/convert_data_dir_to_whole.sh

 #! /bin/bash

-# Copyright 2016  Vimal Manohar
+# Copyright 2016-18  Vimal Manohar


prefer -2018

danpovey · 2018-02-12T03:35:28Z

egs/hub4_english/s5/local/train_lm.sh

+    local/lm/merge_word_counts.py 1 | sort -k 1,1nr > $dir/data/work/final.wordlist_counts
+
+  if [ ! -z "$vocab_size" ]; then
+    awk -v sz=$vocab_size 'BEGIN{count=-1;} 


traditionally you wouldn't use up this much vertical space in an inline awk script.

danpovey · 2018-02-12T03:36:44Z

egs/hub4_english/s5/local/train_lm.sh

+) || exit 1;
+
+num_dev_sentences=4500
+RANDOM=0


remove unused variable

danpovey · 2018-02-12T03:37:22Z

egs/hub4_english/s5/local/run_segmentation_wsj.sh

@@ -0,0 +1 @@
+tuning/run_segmentation_wsj_a.sh


this points to the "a" script. Check that this is what you intended.

danpovey · 2018-02-12T03:38:01Z

egs/hub4_english/s5/local/run_cleanup_segmentation.sh

+# For nnet3 and chain results after cleanup, see the scripts in
+# local/nnet3/run_tdnn.sh and local/chain/run_tdnn.sh
+
+# GMM Results for speaker-independent (SI) and speaker adaptive training (SAT) systems on dev and test sets


Did you intend to add results here? Add them or remove this comment.

danpovey · 2018-02-12T03:40:39Z

egs/hub4_english/s5/local/prepare_dict.sh

+export PATH=$PATH:`pwd`/local/dict
+
+if [ $stage -le 3 ]; then
+  cat $wordlist | python -c '


I think you could replace this inline python script with:
utils/filter_scp.pl --exclude $wordlist <$orig_wordlist > $dir/oovlist

danpovey · 2018-02-12T03:44:17Z

egs/hub4_english/s5/local/prepare_dict.sh

+
+  # Remove ; and , from words, if they are present; these
+  # might crash our scripts, as they are used as separators there.
+  filter_dict.pl $dir/dict.cmu > $dir/f/dict


You seem to have inherited some old WSJ scripts for extending a dictionary. I believe this works worse than g2p.py, when we tested. Is this really needed?

danpovey · 2018-02-12T03:45:32Z

egs/hub4_english/s5/local/score_sclite.sh

+  for x in $dir/score_*/$name.ctm; do
+    cp $x $dir/tmpf;
+    cat $dir/tmpf | grep -i -v -E '<NOISE|SPOKEN_NOISE>' | \
+    grep -i -v -E ' (UH|UM|EH|MM|HM|AH|HUH|HA|ER|OOF|HEE|ACH|EEE|EW)$' | \


Do these things really appear in the BN setup?

I think some of them can appear. There are many test sets. Some can include conversational speech.

danpovey · 2018-02-12T03:46:24Z

egs/hub4_english/s5/run.sh

+  mkdir -p tools
+  pip install -t tools/beautifulsoup4 beautifulsoup4
+fi
+export PYTHONPATH=$PWD/tools/beautifulsoup4:$PYTHONPATH


please add a comment explaining why this is needed.

danpovey · 2018-02-12T19:43:50Z

Regarding multiple silence phones in the same row of dict/silence_phones.txt: I think they are allowed but it may not be the normal case (it would mean you'd share the root).

…

On Mon, Feb 12, 2018 at 11:08 AM, Xiaohui Zhang ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/wsj/s5/steps/dict/train_g2p.sh <#2027 (comment)>: > @@ -45,7 +45,7 @@ mkdir -p $wdir/log # Optionally remove words that are mapped to a single silence phone from the lexicon. if $only_words && [ ! -z "$silence_phones" ]; then awk 'NR==FNR{a[$1] = 1; next} {s=$2;for(i=3;i<=NF;i++) s=s" "$i;a[$1]=s;if(!(s in a)) print $1" "s}' \ - $silence_phones > $wdir/lexicon_onlywords.txt + $lexicon $silence_phones > $wdir/lexicon_onlywords.txt @danpovey <https://github.com/danpovey> do we allow multiple silence phones to occur in the same row in data/local/dict/silence_phones.txt ? I thought they have to be put in different columns. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2027 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu9gguyRM4fpar0bc3u5vpD0x97XYks5tUGHngaJpZM4QmPO5> .

xiaohui-zhang · 2018-02-12T19:57:38Z

understand. thanks

vimalmanohar · 2018-02-12T22:11:10Z

egs/wsj/s5/steps/dict/train_g2p.sh

 if $only_words && [ ! -z "$silence_phones" ]; then
-  awk 'NR==FNR{a[$1] = 1; next} {s=$2;for(i=3;i<=NF;i++) s=s" "$i;a[$1]=s;if(!(s in a)) print $1" "s}' \
-    $silence_phones > $wdir/lexicon_onlywords.txt
+  awk -v w=$silence_phones \


I fixed this. Let me know if this is fine. @xiaohui-zhang @danpovey

danpovey · 2018-02-12T22:13:41Z

egs/wsj/s5/steps/dict/train_g2p.sh

+  awk -v w=$silence_phones \
+    'BEGIN{while((getline<w)>0) {for(i=1;i<=NF;i++) sil[$i]=1;}}
+    { p=$2; for(i=3;i<=NF;i++) p=p" "$i; 
+      if(!(p in sil)) print $1" "p }' $lexicon > $wdir/lexicon_onlywords.txt


couldn't this be simplified to: { if (!(NF == 2 && $2 in sil)) print; }
and I'd prefer to rename w to s.

agreed. otherwise looks good. Thanks Vimal!

vimalmanohar · 2018-02-13T22:22:55Z

I fixed the issues, and it is ready to be committed.

danpovey · 2018-02-13T23:40:48Z

Thanks! Merging.

vimalmanohar added 20 commits November 1, 2017 15:02

[egs] Bug fix in train_raw_dnn.py

ebe5e8d

steps/cleanup: Fixed corner case in resolve_ctm_edits_overlaps.py

fbedee0

Merge branch 'master' of github.com:kaldi-asr/kaldi

fe7d835

Merge branch 'master' of github.com:kaldi-asr/kaldi

ada93ca

bn: Adding BN recipe

f0627cf

bn: Add 1999 BN eval preparation

6e73dec

bn: Add more data preparation scripts

917a670

bn: Fix MFCC config

d275480

bn: Clean and update recipe

4f94a5c

bn: Remove local/lm/text_normalization.py

eb6fccb

bn: Fix normalize_transcripts

351b447

bn: Updated recipe to add more LM corpora

643881e

bn: Updating main recipe

6f316ef

bn: Minor fixes in BN recipe

cc9752c

HUB4 train preparation scripts

634d030

bn: Bug fixes and create new scripts

e6242ed

Minor modifications

560929c

bn: Modifying some preparation scripts

e1dd79e

bn: Updating bn recipe

35c9820

bn: Remove some unused scripts

a528552

bn: rename bn to hub4_english

9f61a1b

vimalmanohar changed the title ~~WIP: Broadcast News recipe~~ WIP: HUB4 English Broadcast News recipe Nov 21, 2017

xiaohui-zhang reviewed Nov 21, 2017

View reviewed changes

egs/hub4_english/s5/local/data_prep/csr_hub4_utils/abbrproc.perl Outdated

}

next;

}

Copy link

Contributor

xiaohui-zhang Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

xiaohui-zhang reviewed Nov 21, 2017

View reviewed changes

egs/hub4_english/s5/local/data_prep/csr_hub4_utils/abbrproc.perl Outdated

else {&pusho($back);}

}

next;

Copy link

Contributor

xiaohui-zhang Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

xiaohui-zhang reviewed Nov 21, 2017

View reviewed changes

vimalmanohar added 2 commits November 27, 2017 15:51

bn: Adding patch instead of copying corpus files

acbcc2c

bn: Cleaning up

5edf464

vimalmanohar added 5 commits February 1, 2018 17:52

Minor bug fixes

3c96a5f

Minor fixes to make the recipe work

8aeec32

Adding results

a9e1668

Merge branch 'master' of github.com:vimalmanohar/kaldi into bn

0e6c8de

Fixing some bugs in segment long utterances

427c38a

vimalmanohar changed the title ~~WIP: HUB4 English Broadcast News recipe~~ HUB4 English Broadcast News recipe Feb 8, 2018

danpovey reviewed Feb 8, 2018

View reviewed changes

Making changes based on comments

cd7e12a

danpovey reviewed Feb 9, 2018

View reviewed changes

Update results and remove multiple versions of script

befd7ee

danpovey reviewed Feb 12, 2018

View reviewed changes

Fixing some issues based on comments

f846849

vimalmanohar commented Feb 12, 2018

View reviewed changes

danpovey reviewed Feb 12, 2018

View reviewed changes

vimalmanohar added 2 commits February 12, 2018 17:18

Fixing train_g2p.sh

da53367

Fixing some lm scripts

f53a18a

danpovey merged commit 5ea9b0d into kaldi-asr:master Feb 13, 2018

LvHang pushed a commit to LvHang/kaldi that referenced this pull request Apr 14, 2018

[egs,scripts] HUB4 English Broadcast News recipe (kaldi-asr#2027)

a77bcd5

Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018

[egs,scripts] HUB4 English Broadcast News recipe (kaldi-asr#2027)

b36501b

		@@ -0,0 +1 @@
		tuning/run_segmentation_wsj_a.sh No newline at end of file

HUB4 English Broadcast News recipe #2027

HUB4 English Broadcast News recipe #2027

Uh oh!

Conversation

vimalmanohar commented Nov 21, 2017

Uh oh!

jtrmal commented Nov 21, 2017 via email

Uh oh!

vimalmanohar commented Nov 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtrmal commented Nov 21, 2017 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danpovey commented Jan 31, 2018

Uh oh!

vimalmanohar commented Feb 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danpovey Feb 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vimalmanohar commented Feb 8, 2018 via email

Uh oh!

danpovey commented Feb 8, 2018 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vimalmanohar commented Feb 11, 2018

Uh oh!

danpovey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danpovey commented Feb 12, 2018 via email

Uh oh!

xiaohui-zhang commented Feb 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vimalmanohar commented Feb 13, 2018

danpovey Feb 8, 2018 •

edited

Loading