multi_en: Fixed acronym normalization, swbd lexicon preparation, OOV … #2137

xiaohui-zhang · 2018-01-09T18:08:22Z

…pronunciation generation, acoustic data sub-sampling,.etc; Added hub4_97 data. @vimalmanohar can you please review it? Thanks!

vimalmanohar · 2018-01-23T18:26:45Z

egs/multi_en/s5/run.sh

+       --cmd "$train_cmd" --nj 40 \
+       $data exp/make_mfcc/$c/train || exit 1;
+     steps/compute_cmvn_stats.sh \
+       $data exp/make_mfcc/$c/train || exit 1;


Exiting here is not very useful. Better to do || touch $dir/.error and check for that later.

vimalmanohar · 2018-01-23T18:28:51Z

egs/multi_en/s5/run.sh

-  # We prepare the dictionary in data/local/dict_combined.
-  local/prepare_dict.sh $swbd $tedlium2
-  local/g2p/train_g2p.sh --stage 0 --silence-phones "data/local/dict_combined/silence_phones.txt" data/local/dict_combined exp/g2p
+  wait


Comment that you are waiting for train_g2p.sh to finish.

Also you should check for errors if it has failed.

xiaohui-zhang · 2018-01-23T20:34:59Z

fixed

vimalmanohar · 2018-01-25T16:52:27Z

egs/fisher_swbd/s5/path.sh

@@ -1,4 +1,4 @@
-export KALDI_ROOT=`pwd`/../../..
+export KALDI_ROOT=../../kaldi


Don't change the standard, which is the same in all recipes. Also fix this in multi-en and elsewhere you changed.

vimalmanohar · 2018-01-25T16:52:37Z

egs/fisher_swbd/s5/run.sh


 # prepare fisher data and put it under data/train_fisher
-local/fisher_data_prep.sh /export/corpora3/LDC/LDC2004T19 /export/corpora3/LDC/LDC2005T19 \
+ local/fisher_data_prep.sh /export/corpora3/LDC/LDC2004T19 /export/corpora3/LDC/LDC2005T19 \


Fix indentation.

vimalmanohar · 2018-01-25T16:53:02Z

egs/mini_librispeech/s5/run.sh

  steps/align_si.sh --boost-silence 1.25 --nj 5 --cmd "$train_cmd" \
    data/train_clean_5 data/lang_nosp exp/mono exp/mono_ali_train_clean_5
 fi
-


This is not meant to be changed.

vimalmanohar · 2018-01-25T16:53:25Z

egs/multi_en/s5/path.sh

@@ -1,7 +1,8 @@
-export KALDI_ROOT=`pwd`/../../..
+export KALDI_ROOT=../../kaldi


Change this.

vimalmanohar · 2018-01-25T16:53:41Z

egs/swbd/s5c/path.sh

@@ -1,4 +1,4 @@
-export KALDI_ROOT=`pwd`/../../..
+export KALDI_ROOT=/export/b04/xzhang/kaldi


vimalmanohar · 2018-01-25T16:58:37Z

egs/fisher_swbd/s5/local/chain/run_tdnn_lstm_1a.sh

  --speed-perturb $speed_perturb \
  --generate-alignments $speed_perturb || exit 1;
-
+exit


Remove this.

vimalmanohar · 2018-01-25T16:59:46Z

egs/swbd/s5c/local/chain/tuning/run_tdnn_lstm_1l.sh

 # configs for 'chain'
-stage=12
-train_stage=-10
+stage=15


Undo this change.

vimalmanohar · 2018-01-25T17:00:43Z

egs/wsj/s5/utils/mkgraph.sh

 tscale=1.0
 loopscale=0.1
-
+unkscale=1.0


This does not seem to be used.

sorry I unintentionally committed some files I didn't plan to commit..

finished cleaning up

vimalmanohar · 2018-01-25T23:58:50Z

egs/multi_en/s5/run.sh

 if [ $stage -le 8 ]; then
  # get rid of spk2gender files because not all corpora have them
-  rm -f data/*/train/spk2gender
+  rm data/*/train/spk2gender 2>/dev/null


There should be -f. Otherwise the script might exit because of set -e.

I don't like using -f to make 'rm' be quiet, because it overrides when people change modes to save something. Instead, do || true

vimalmanohar · 2018-01-25T23:59:51Z

egs/multi_en/s5/run.sh

  cp data/local/dict_combined/{extra_questions,nonsilence_phones,silence_phones,optional_silence}.txt $dict_dir
-  local/g2p/apply_g2p.sh --var-counts 1 exp/g2p/model.fst data/local/g2p_phonetisarus data/local/dict_combined/lexicon.txt $dict_dir/lexicon.txt
+  local/g2p/apply_g2p.sh --var-counts 1 exp/g2p/model.fst data/local/g2p_phonetisarus \
+    data/local/dict_combined/lexicon.txt $dict_dir/lexicon.txt || touch $dict_dir/.error


This does not seem to be running in the background. So better to use || exit

vimalmanohar · 2018-01-26T00:01:19Z

egs/multi_en/s5/run.sh

+     data=data/$c/train
+     steps/make_mfcc.sh --mfcc-config conf/mfcc.conf \
+       --cmd "$train_cmd" --nj 40 \
+       $data exp/make_mfcc/$c/train || touch $data/.error


Are you checking if there is an error in a later stage? Also you should first remove the rm -f the .error files.

hmm I actually made it back to || exit 1; since I think we should quit if the feature extraction fails

|| exit 1 will have no effect inside a subshell running in background.

vimalmanohar · 2018-01-26T00:02:13Z

egs/multi_en/s5/run.sh

  local/wsj_format_data.sh
  utils/copy_data_dir.sh --spk_prefix wsj_ --utt_prefix wsj_ data/wsj/train_si284 data/wsj/train
-  rm -rf data/wsj/train_si284
+  rm -r data/wsj/train_si284 2>/dev/null


Without -f the script will exit if error since you gave set -e

danpovey

couple comments..

danpovey · 2018-01-31T23:10:32Z

egs/multi_en/s5/RESULTS

-multi_a  tri5a  tg_sp_eval2000.si  ||  %WER 30.5 | 4459 42989 | 73.7 20.3 6.0 4.2 30.5 67.8 | exp/multi_a/tri5a/decode_tg_sp_eval2000.si/score_10_1.0/eval2000.ctm.filt.sys
-multi_a  tri5b  tg_eval2000        ||  %WER 24.3 | 4459 42989 | 79.3 15.7 5.0 3.6 24.3 63.5 | exp/multi_a/tri5b/decode_tg_eval2000/score_11_0.0/eval2000.ctm.filt.sys
-multi_a  tri5b  tg_eval2000.si     ||  %WER 30.7 | 4459 42989 | 73.6 20.4 6.0 4.3 30.7 68.1 | exp/multi_a/tri5b/decode_tg_eval2000.si/score_10_1.0/eval2000.ctm.filt.sys
+# On eval2000 the final GMM results is 24.5, which is better than the above result (24.9). 


I'm pretty sure this comment doesn't belong here.

Also the results seems lightly worse than before. Is there an explanation for this? Is it the case that for other test sets it was better?
What exactly is better about this new setup than how it was before?

I fixed some bugs of acronym normalization in swbd training data prep (previously these acronyms were wrongly normalized as OOVs), and I fixed the way we apply G2P to OOVs in training data, i.e. we won't apply G2P to words containing special characters. But these changes are too tiny to affect WERs. The major change which may affect WER is we now have 72h more training data (hub4 97). Since there are mismatch between hub4's bn style and eval2000's style, there could be some degradation at GMM level (actually looking tri5a.si and tri5b.si it's slightly better than before). I think this could be compensated at chain model level. BTW I intentionally made that comment there...

…pronunciation generation, acoustic data sub-sampling,.etc; Added hub4_97 data

xiaohui-zhang · 2018-02-07T05:49:12Z

ready to merge I think

…, OOV … (kaldi-asr#2137) * multi_en: Fixed acronym normalization, swbd lexicon preparation, OOV pronunciation generation, acoustic data sub-sampling,.etc; Added hub4_97 data

xiaohui-zhang force-pushed the lexlearn branch from 342a3f3 to 4998968 Compare January 20, 2018 06:26

vimalmanohar reviewed Jan 23, 2018

View reviewed changes

xiaohui-zhang force-pushed the lexlearn branch from 4998968 to 8295e35 Compare January 23, 2018 20:33

vimalmanohar reviewed Jan 25, 2018

View reviewed changes

xiaohui-zhang force-pushed the lexlearn branch 3 times, most recently from 41d52dd to 7b5d0e8 Compare January 25, 2018 23:35

vimalmanohar reviewed Jan 26, 2018

View reviewed changes

xiaohui-zhang force-pushed the lexlearn branch from 7b5d0e8 to 0d881e8 Compare January 26, 2018 01:48

danpovey reviewed Jan 31, 2018

View reviewed changes

xiaohui-zhang added 7 commits February 7, 2018 00:38

multi_en: Fixed acronym normalization, swbd lexicon preparation, OOV …

827ddbf

…pronunciation generation, acoustic data sub-sampling,.etc; Added hub4_97 data

removed the last GMM training stage which didn't bring improvements

508358f

minor fix

f3549f2

minor fix

2283631

minor fix

dc18c1c

fixes kaldi-asr#2206

7eec69d

minor fixes

1d4959a

xiaohui-zhang force-pushed the lexlearn branch from f245faa to 1d4959a Compare February 7, 2018 05:47

danpovey merged commit 8e170e0 into kaldi-asr:master Feb 7, 2018

		@@ -1,4 +1,4 @@
		export KALDI_ROOT=`pwd`/../../..
		export KALDI_ROOT=../../kaldi

		@@ -1,7 +1,8 @@
		export KALDI_ROOT=`pwd`/../../..
		export KALDI_ROOT=../../kaldi

		@@ -1,4 +1,4 @@
		export KALDI_ROOT=`pwd`/../../..
		export KALDI_ROOT=/export/b04/xzhang/kaldi

multi_en: Fixed acronym normalization, swbd lexicon preparation, OOV … #2137

multi_en: Fixed acronym normalization, swbd lexicon preparation, OOV … #2137

Uh oh!

Conversation

xiaohui-zhang commented Jan 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaohui-zhang commented Jan 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vimalmanohar Jan 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danpovey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaohui-zhang Feb 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaohui-zhang commented Feb 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vimalmanohar Jan 25, 2018 •

edited

Loading

xiaohui-zhang Feb 1, 2018 •

edited

Loading