Skip to content

Conversation

@xiaohui-zhang
Copy link
Contributor

…pronunciation generation, acoustic data sub-sampling,.etc; Added hub4_97 data. @vimalmanohar can you please review it? Thanks!

--cmd "$train_cmd" --nj 40 \
$data exp/make_mfcc/$c/train || exit 1;
steps/compute_cmvn_stats.sh \
$data exp/make_mfcc/$c/train || exit 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exiting here is not very useful. Better to do || touch $dir/.error and check for that later.

# We prepare the dictionary in data/local/dict_combined.
local/prepare_dict.sh $swbd $tedlium2
local/g2p/train_g2p.sh --stage 0 --silence-phones "data/local/dict_combined/silence_phones.txt" data/local/dict_combined exp/g2p
wait
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment that you are waiting for train_g2p.sh to finish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you should check for errors if it has failed.

@xiaohui-zhang
Copy link
Contributor Author

fixed

@@ -1,4 +1,4 @@
export KALDI_ROOT=`pwd`/../../..
export KALDI_ROOT=../../kaldi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't change the standard, which is the same in all recipes. Also fix this in multi-en and elsewhere you changed.


# prepare fisher data and put it under data/train_fisher
local/fisher_data_prep.sh /export/corpora3/LDC/LDC2004T19 /export/corpora3/LDC/LDC2005T19 \
local/fisher_data_prep.sh /export/corpora3/LDC/LDC2004T19 /export/corpora3/LDC/LDC2005T19 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix indentation.

steps/align_si.sh --boost-silence 1.25 --nj 5 --cmd "$train_cmd" \
data/train_clean_5 data/lang_nosp exp/mono exp/mono_ali_train_clean_5
fi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not meant to be changed.

@@ -1,7 +1,8 @@
export KALDI_ROOT=`pwd`/../../..
export KALDI_ROOT=../../kaldi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this.

@@ -1,4 +1,4 @@
export KALDI_ROOT=`pwd`/../../..
export KALDI_ROOT=/export/b04/xzhang/kaldi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix this

--speed-perturb $speed_perturb \
--generate-alignments $speed_perturb || exit 1;

exit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this.

# configs for 'chain'
stage=12
train_stage=-10
stage=15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undo this change.

tscale=1.0
loopscale=0.1

unkscale=1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem to be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I unintentionally committed some files I didn't plan to commit..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finished cleaning up

@xiaohui-zhang xiaohui-zhang force-pushed the lexlearn branch 3 times, most recently from 41d52dd to 7b5d0e8 Compare January 25, 2018 23:35
if [ $stage -le 8 ]; then
# get rid of spk2gender files because not all corpora have them
rm -f data/*/train/spk2gender
rm data/*/train/spk2gender 2>/dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be -f. Otherwise the script might exit because of set -e.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like using -f to make 'rm' be quiet, because it overrides when people change modes to save something. Instead, do || true

cp data/local/dict_combined/{extra_questions,nonsilence_phones,silence_phones,optional_silence}.txt $dict_dir
local/g2p/apply_g2p.sh --var-counts 1 exp/g2p/model.fst data/local/g2p_phonetisarus data/local/dict_combined/lexicon.txt $dict_dir/lexicon.txt
local/g2p/apply_g2p.sh --var-counts 1 exp/g2p/model.fst data/local/g2p_phonetisarus \
data/local/dict_combined/lexicon.txt $dict_dir/lexicon.txt || touch $dict_dir/.error
Copy link
Contributor

@vimalmanohar vimalmanohar Jan 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem to be running in the background. So better to use || exit

data=data/$c/train
steps/make_mfcc.sh --mfcc-config conf/mfcc.conf \
--cmd "$train_cmd" --nj 40 \
$data exp/make_mfcc/$c/train || touch $data/.error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you checking if there is an error in a later stage? Also you should first remove the rm -f the .error files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I actually made it back to || exit 1; since I think we should quit if the feature extraction fails

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|| exit 1 will have no effect inside a subshell running in background.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

local/wsj_format_data.sh
utils/copy_data_dir.sh --spk_prefix wsj_ --utt_prefix wsj_ data/wsj/train_si284 data/wsj/train
rm -rf data/wsj/train_si284
rm -r data/wsj/train_si284 2>/dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without -f the script will exit if error since you gave set -e

Copy link
Contributor

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple comments..

multi_a tri5a tg_sp_eval2000.si || %WER 30.5 | 4459 42989 | 73.7 20.3 6.0 4.2 30.5 67.8 | exp/multi_a/tri5a/decode_tg_sp_eval2000.si/score_10_1.0/eval2000.ctm.filt.sys
multi_a tri5b tg_eval2000 || %WER 24.3 | 4459 42989 | 79.3 15.7 5.0 3.6 24.3 63.5 | exp/multi_a/tri5b/decode_tg_eval2000/score_11_0.0/eval2000.ctm.filt.sys
multi_a tri5b tg_eval2000.si || %WER 30.7 | 4459 42989 | 73.6 20.4 6.0 4.3 30.7 68.1 | exp/multi_a/tri5b/decode_tg_eval2000.si/score_10_1.0/eval2000.ctm.filt.sys
# On eval2000 the final GMM results is 24.5, which is better than the above result (24.9).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this comment doesn't belong here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the results seems lightly worse than before. Is there an explanation for this? Is it the case that for other test sets it was better?
What exactly is better about this new setup than how it was before?

Copy link
Contributor Author

@xiaohui-zhang xiaohui-zhang Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed some bugs of acronym normalization in swbd training data prep (previously these acronyms were wrongly normalized as OOVs), and I fixed the way we apply G2P to OOVs in training data, i.e. we won't apply G2P to words containing special characters. But these changes are too tiny to affect WERs. The major change which may affect WER is we now have 72h more training data (hub4 97). Since there are mismatch between hub4's bn style and eval2000's style, there could be some degradation at GMM level (actually looking tri5a.si and tri5b.si it's slightly better than before). I think this could be compensated at chain model level. BTW I intentionally made that comment there...

@xiaohui-zhang
Copy link
Contributor Author

ready to merge I think

@danpovey danpovey merged commit 8e170e0 into kaldi-asr:master Feb 7, 2018
Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
…, OOV … (kaldi-asr#2137)

* multi_en: Fixed acronym normalization, swbd lexicon preparation, OOV pronunciation generation, acoustic data sub-sampling,.etc; Added hub4_97 data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants