Sprak nnet3 #1402

dresen · 2017-02-07T14:40:01Z

No description provided.

Update from original

Merging to resolve conflict

Swedish changes (kaldi-asr#1242)

…nd run_ivector_common.sh to this setup. Achieves new state-of-the-art on dev set (~11% WER).

… and added the --tries 100 flag to wget in sprak_data_prep.sh so the download does not break due to broken connections

… scripts and removed python3 check+install from sprak_data_prep.sh because the script is compatible with python2.

danpovey · 2017-02-07T18:40:38Z

Thanks! It may be a day or two before I get to this, I am a bit busy today.'

…

On Tue, Feb 7, 2017 at 9:40 AM, Andreas Søeborg Kirkedal < ***@***.***> wrote: ------------------------------ You can view, comment on, or merge this pull request online at: #1402 Commit Summary - Merge pull request #4 from kaldi-asr/master - Merge pull request #5 from kaldi-asr/master - Merge pull request #6 from kaldi-asr/master - Merge pull request #7 from kaldi-asr/master - Added nnet3 recipes copied from tedlium/s5_r2. Modified run_tdnn.sh and run_ivector_common.sh to this setup. Achieves new state-of-the-art on dev set (~11% WER). - Modified run_lstm.sh to work with the sprakbanken data got 11.47 %WER and added the --tries 100 flag to wget in sprak_data_prep.sh so the download does not break due to broken connections - Modified xent and chain training scripts and added tham to run.sh - Removed old unused scripts in local/ and updated the RESULTS file - Removed comments from the tedlium setup from chain and nnet3 training scripts and removed python3 check+install from sprak_data_prep.sh because the script is compatible with python2. File Changes - *M* egs/sprakbanken/s5/RESULTS <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-0> (46) - *A* egs/sprakbanken/s5/conf/mfcc_hires.conf <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-1> (11) - *A* egs/sprakbanken/s5/conf/online_cmvn.conf <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-2> (1) - *A* egs/sprakbanken/s5/local/chain/compare_wer_general.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-3> (64) - *A* egs/sprakbanken/s5/local/chain/run_tdnn.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-4> (1) - *A* egs/sprakbanken/s5/local/chain/run_tdnn_lstm.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-5> (1) - *A* egs/sprakbanken/s5/local/chain/tuning/run_lstm_1a.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-6> (260) - *A* egs/sprakbanken/s5/local/chain/tuning/run_lstm_1b.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-7> (261) - *A* egs/sprakbanken/s5/local/chain/tuning/run_lstm_1c.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-8> (259) - *A* egs/sprakbanken/s5/local/chain/tuning/run_lstm_1d.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-9> (272) - *A* egs/sprakbanken/s5/local/chain/tuning/run_lstm_1e.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-10> (262) - *A* egs/sprakbanken/s5/local/chain/tuning/run_tdnn_1a.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-11> (202) - *A* egs/sprakbanken/s5/local/chain/tuning/run_tdnn_1b.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-12> (227) - *A* egs/sprakbanken/s5/local/chain/tuning/run_tdnn_lstm_1a.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-13> (255) - *D* egs/sprakbanken/s5/local/cstr_ndx2flist.pl <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-14> (54) - *D* egs/sprakbanken/s5/local/find_transcripts.pl <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-15> (64) - *D* egs/sprakbanken/s5/local/flist2scp.pl <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-16> (31) - *D* egs/sprakbanken/s5/local/generate_example_kws.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-17> (110) - *A* egs/sprakbanken/s5/local/generate_results_file.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-18> (16) - *D* egs/sprakbanken/s5/local/kws_data_prep.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-19> (60) - *A* egs/sprakbanken/s5/local/nnet3/run_blstm.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-20> (48) - *A* egs/sprakbanken/s5/local/nnet3/run_ivector_common.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-21> (238) - *A* egs/sprakbanken/s5/local/nnet3/run_lstm.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-22> (174) - *A* egs/sprakbanken/s5/local/nnet3/run_tdnn.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-23> (102) - *D* egs/sprakbanken/s5/local/run_basis_fmllr.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-24> (42) - *D* egs/sprakbanken/s5/local/run_kl_hmm.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-25> (24) - *D* egs/sprakbanken/s5/local/run_raw_fmllr.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-26> (67) - *M* egs/sprakbanken/s5/local/sprak_data_prep.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-27> (17) - *D* egs/sprakbanken/s5/local/sprak_run_mmi_tri4b.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-28> (56) - *D* egs/sprakbanken/s5/local/sprak_train_cmulm.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-29> (61) - *M* egs/sprakbanken/s5/run.sh <https://github.com/kaldi-asr/kaldi/pull/1402/files#diff-30> (35) Patch Links: - https://github.com/kaldi-asr/kaldi/pull/1402.patch - https://github.com/kaldi-asr/kaldi/pull/1402.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1402>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu1PhVzaZUH7fzN5UZ6EZP2Sa2JUCks5raIJGgaJpZM4L5kmo> .

galv · 2017-02-08T05:00:35Z

I noticed that you have a lot of early commits that merge the kaldi master branch. I recommend you get rid of those. In the future you can rebase your work on top of the kaldi master branch. But for now, you can cherry-pick your commits on top of a fresh master branch.

i.e., if you make a branch my-branc which is a copy of master:

git cherry-pick dbca511
git cherry-pick d3d4e41

and so on. git will tell you if there are conflicts but I doubt this will be the case for a new recipe.

danpovey · 2017-02-08T05:05:11Z

I've got in the habit of doing 'squash and merge' rather than retaining the history of the committer's branch. So this may not matter.

…

On Wed, Feb 8, 2017 at 12:00 AM, Daniel Galvez ***@***.***> wrote: I noticed that you have a lot of early commits that merge the kaldi master branch. I recommend you get rid of those. In the future you can rebase your work on top of the kaldi master branch. But for now, you can cherry-pick your commits on top of a fresh master branch. i.e., if you make a branch my-branc which is a copy of master: git cherry-pick dbca511 git cherry-pick d3d4e41 and so on. git will tell you if there are conflicts but I doubt this will be the case for a new recipe. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1402 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu3X9A3wmUuDfjKoVD16D0x8Tf__Lks5raUv1gaJpZM4L5kmo> .

galv

I checked with grep that all the removed files are indeed unused files from a long time ago, so that part looks good to me.

I haven't reviewed anything inside the tuning/ directory yet.

galv · 2017-02-08T05:11:28Z

egs/sprakbanken/s5/RESULTS

-%WER 20.82 [ 2251 / 10811, 399 ins, 471 del, 1381 sub ] exp/tri4b/decode_3g_test1k.si/wer_13
-%WER 17.53 [ 1895 / 10811, 403 ins, 375 del, 1117 sub ] exp/tri4b/decode_4g_test1k/wer_13
-%WER 20.99 [ 2269 / 10811, 438 ins, 436 del, 1395 sub ] exp/tri4b/decode_4g_test1k.si/wer_11
+%WER 22.87 [ 24286 / 106172, 3577 ins, 5321 del, 15388 sub ] exp/tri1/decode_fg_dev/wer_12_0.5


Looks like you removed the results of existing recipes. Why?

Those results are based on a subset of the whole test set that was used to do "fast testing". This ended up being more of a development set which is not acceptable for a subset of the test set, so I redid all GMM experiments with tuning on the actual dev set and rather than test1k and proceeded to the same for nnet3 and chain recipes.

galv · 2017-02-08T05:19:36Z

egs/sprakbanken/s5/conf/online_cmvn.conf

@@ -0,0 +1 @@
+# configuration file for apply-cmvn-online, used in the script ../local/run_online_decoding.sh


local/run_online_decoding.sh does not exist for this recipe. Maybe you accidentally copied this file from tedlium.

If it is unused, delete it.

This is the next thing I want to work on.

galv · 2017-02-08T05:26:02Z

egs/sprakbanken/s5/local/chain/compare_wer_general.sh

@@ -0,0 +1,64 @@
+#!/bin/bash
+
+echo $0 $*


Be sure to document what arguments are appropriate for $. Looks like it's simply the names of the exp/ directories you want to compare. You should print this help message out if $ is empty.

galv · 2017-02-08T05:30:30Z

egs/sprakbanken/s5/local/chain/compare_wer_general.sh

+  prob=$(grep Overall $x/log/compute_prob_valid.final.log | grep -w xent | awk '{printf("%.4f", $8)}')
+  printf "% 10s" $prob
+done
+echo


There is nothing about this file specific to the sprakbanken dataset. It should therefore belong in utils/ instead. Probably the best place to put it is utils/nnet3 but that doesn't exist, so maybe steps/nnet3 is a better place (Especially since there are other "utility"-like scripts there, like the reporting scripts and nnet3_to_dot.sh

Another maintainer should comment on whether this script is redundant with some other script and whether it should be included.

I believe this script may not have been properly updated to reflect the characteristics of the Sprakbanken database. It is supposed to print out a little table for comparing WER and objective values between different training runs. But this seems to be a copy of some other script. It does contain some corpus-specific things-- the set of names of decoded directories, and what type of scoring they have (Kaldi vs. sclite).

It's a fully new script according to git, so it's either newly created or copy-pasted from elsewhere, actually. It clearly works for @dresen if it's here now. Regardless, if it's not robust and clearly-documented, my guess is that we shouldn't keep it.

I had forgotten to remove this script. I copied it over from tedlium beecause I wanted to get it to work, but as I understand score_sclite.sh, I need a 'glm' or 'slm' file, which is not part of the språkbanken data

You could easily get it to work, compare with the Switchboard setup where for the dev set it uses Kaldi scoring and not sclite scoring. But either get it to work or remove it.

galv · 2017-02-08T05:31:34Z

egs/sprakbanken/s5/local/chain/compare_wer_general.sh

+echo $0 $*
+
+echo -n "System               "
+for x in $*; do   printf "% 10s" " $(basename $x)";   done


Change $* to "$@"

It's important in case someone has an experimental directory that contains a space. That's probably never going to happen, but you never know.

galv · 2017-02-08T05:33:16Z

egs/sprakbanken/s5/local/chain/run_tdnn.sh

@@ -0,0 +1 @@
+tuning/run_tdnn_1b.sh


It doesn't really matter, but best to use #!/bin/bash at the top here. Same for run_tdnn_lstm.sh

galv · 2017-02-08T05:35:30Z

egs/sprakbanken/s5/local/chain/run_tdnn.sh

@@ -0,0 +1 @@
+tuning/run_tdnn_1b.sh


Also, you have run_tdnn_lstm.sh and run_tdnn.sh, but you're missing local/chain/run_lstm.sh Any reason why you don't have the last file?

It wasn't in the tedlium folder I copied, but I'll add it.

Actually no need to have the 'just-lstm' number, generally TDNN+LSTM will work better.

galv · 2017-02-08T05:38:03Z

egs/sprakbanken/s5/local/chain/tuning/run_tdnn_1a.sh

+# note, if you have already run the corresponding non-chain nnet3 system
+# (local/nnet3/run_tdnn.sh), you may want to run with --stage 14.
+
+set -e -o pipefail


@danpovey @vijayaditya Do we consider these settings okay? I feel like I remember being asked to remove these in one of my earlier pull requests, but I'm not sure. At the every least, I think I was asked to remove -e because you all prefer being explicit with || exit 1 clauses.

The current situation is many scripts do have the 'set -e', and many do not. Officially, either is OK right now. It's not an ideal situation but that's the way it is.

galv · 2017-02-08T05:47:58Z

egs/sprakbanken/s5/run.sh

 #local/sprak_run_sgmm2.sh dev

+# Run neural network setups based in the TEDLIUM recipe
+if [ $stage -le 11 ]; then


@danpovey Is it typical to include nnet3 recipes inside run.sh? I feel that leaving in the lstm recipes is not good since those have the potential to run for a very long time without someone realizing.

Hm. I'm not sure. I'd probably prefer to include the command (or commands) but commented out, with a comment saying that this is the command to run [such-and-such.]

I'll comment it out

danpovey · 2017-02-08T06:41:29Z

OK, let's see what he says about whether it's been tested. It was probably copied from a similar script in egs/swbd/s5c or egs/ami/s5b or egs/tedlium/s5_r2, but not sure if it has been used locally. Those scripts need to be documented better.

…

On Wed, Feb 8, 2017 at 1:01 AM, Daniel Galvez ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/sprakbanken/s5/local/chain/compare_wer_general.sh <#1402>: > +done +echo + +echo -n "Final train prob (xent)" +for x in $*; do + prob=$(grep Overall $x/log/compute_prob_train.final.log | grep -w xent | awk '{printf("%.4f", $8)}') + printf "% 10s" $prob +done +echo + +echo -n "Final valid prob (xent)" +for x in $*; do + prob=$(grep Overall $x/log/compute_prob_valid.final.log | grep -w xent | awk '{printf("%.4f", $8)}') + printf "% 10s" $prob +done +echo It's a fully new script according to git, so it's either newly created or copy-pasted from elsewhere, actually. It clearly works for @dresen <https://github.com/dresen> if it's here now. Regardless, if it's not robust and clearly-documented, my guess is that we shouldn't keep it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1402>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuwp2gguB-2K1iBOX1TPw1W2qAamcks5raVo4gaJpZM4L5kmo> .

danpovey · 2017-02-09T20:57:20Z

Looks good. I'll merge and if any issues are found in future we'll address them then.

dresen added 9 commits November 10, 2016 09:07

Merge pull request #4 from kaldi-asr/master

1472b0b

Update from original

Merge pull request #5 from kaldi-asr/master

e840588

Update from original

Merge pull request #6 from kaldi-asr/master

b0eba63

Merging to resolve conflict

Merge pull request #7 from kaldi-asr/master

bec69c2

Swedish changes (kaldi-asr#1242)

Added nnet3 recipes copied from tedlium/s5_r2. Modified run_tdnn.sh a…

dbca511

…nd run_ivector_common.sh to this setup. Achieves new state-of-the-art on dev set (~11% WER).

Modified run_lstm.sh to work with the sprakbanken data got 11.47 %WER…

d3d4e41

… and added the --tries 100 flag to wget in sprak_data_prep.sh so the download does not break due to broken connections

Modified xent and chain training scripts and added tham to run.sh

fbcaf6e

Removed old unused scripts in local/ and updated the RESULTS file

d91019e

Removed comments from the tedlium setup from chain and nnet3 training…

3c21d80

… scripts and removed python3 check+install from sprak_data_prep.sh because the script is compatible with python2.

galv reviewed Feb 8, 2017

View reviewed changes

Adressed the comments from @danpovey and @galv

fcdc807

danpovey merged commit bcc71b6 into kaldi-asr:master Feb 9, 2017

		@@ -0,0 +1 @@
		# configuration file for apply-cmvn-online, used in the script ../local/run_online_decoding.sh

		@@ -0,0 +1 @@
		tuning/run_tdnn_1b.sh No newline at end of file

Sprak nnet3 #1402

Sprak nnet3 #1402

Uh oh!

Conversation

dresen commented Feb 7, 2017

Uh oh!

danpovey commented Feb 7, 2017 via email

Uh oh!

galv commented Feb 8, 2017

Uh oh!

danpovey commented Feb 8, 2017 via email

Uh oh!

galv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danpovey commented Feb 8, 2017 via email

Uh oh!

danpovey commented Feb 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants