[WIP] Semi-supervised training using chain models #1657

hhadian · 2017-05-29T22:30:00Z

I have tried to write a rather generic script local/chain/run_semisupervised.sh which can be easily used for other setups and configs. Please let me know if anything needs to be fixed/improved. I'm working on it and will push more changes whenever ready.
Current initial results with left/right-tolerance=2 and lattice-lm-scale=0.1 (nothing is tuned):

# System                baseline(200h) sup(50h) combined1 combined2
# WER on dev(orig)            9.3      10.6      10.3      10.3
# WER on dev(rescored)        8.7       9.8       9.6       9.6
# WER on test(orig)           9.4      11.0      10.6      10.5
# WER on test(rescored)       9.0      10.3      10.0       9.8
# Final train prob        -0.0893   -0.0861   -0.0882   -0.0890
# Final valid prob        -0.1068   -0.1216   -0.1124   -0.1142
# Final train prob (xent)   -1.4083   -1.5612   -1.4318   -1.4369
# Final valid prob (xent)   -1.4601   -1.6863   -1.5244   -1.5223

combined2 means the "decoding of unsupervised data and combining and retraining" has been performed for a second time (using combined1 model). Recovery of degaradation is from ~25% to ~38%.
Note: Supervised training is done almost only using the supervised data (i.e. 50h, even the ivector extractor and ivectors) except the initial fmllr alignment of the supervised data which I did using the tri3 gmm model trained on the whole dataset (I imagined it wouldn't make a difference).
Also, I did not use Pegah's egs-combination scripts because I realized a simple nnet3-chain-copy-egs command could do it.

@vimalmanohar @pegahgh

danpovey · 2017-05-29T22:35:02Z

Cool. I'd rather have @vimalmanohar review this for now, than getting into it myself. Something I think will be critical, is the splitting of the egs. When Vimal split lattices for discriminative training, he put special attention on the initial/final probs where they were split. I think what is going on here is much simpler, as the acoustic scores are discarded long before the lattices get split. It might lead to a situation where a too-deep lattice would actually give you a degradation because you have too many low-likelihood paths active near the edges of the chunks. Anyway, the degradation from this should decrease as the egs from the unsupervised part get longer, so if you try doubling the length of the egs from the unsupervised part, it will give us a sense of how important this issue is.

…

On Mon, May 29, 2017 at 6:30 PM, Hossein Hadian ***@***.***> wrote: I have tried to write a rather generic script local/chain/run_ semisupervised.sh which can be easily used for other setups and configs. Please let me know if anything needs to be fixed/improved. I'm working on it and will push more changes whenever ready. Current initial results with left/right-tolerance=2 and lattice-lm-scale=0.1 (nothing is tuned): # System baseline(200h) sup(50h) combined1 combined2 # WER on dev(orig) 9.3 10.6 10.3 10.3 # WER on dev(rescored) 8.7 9.8 9.6 9.6 # WER on test(orig) 9.4 11.0 10.6 10.5 # WER on test(rescored) 9.0 10.3 10.0 9.8 # Final train prob -0.0893 -0.0861 -0.0882 -0.0890 # Final valid prob -0.1068 -0.1216 -0.1124 -0.1142 # Final train prob (xent) -1.4083 -1.5612 -1.4318 -1.4369 # Final valid prob (xent) -1.4601 -1.6863 -1.5244 -1.5223 combined2 means the "decoding of unsupervised data and combining and retraining" has been performed for a second time (using combined1 model). Recovery of degaradation is from ~25% to ~38%. Note: Supervised training is done almost only using the supervised data (i.e. 50h, even the ivector extractor and ivectors) except the initial fmllr alignment of the supervised data which I did using the tri3 gmm model trained on the whole dataset (I imagined it wouldn't make a difference). Also, I did not use Pegah's egs-combination scripts because I realized a simple nnet3-chain-copy-egs command could do it. @vimalmanohar <https://github.com/vimalmanohar> @pegahgh <https://github.com/pegahgh> ------------------------------ You can view, comment on, or merge this pull request online at: #1657 Commit Summary - [WIP] Add chain semi-supervised script + src changes - Minor fixes File Changes - *A* egs/tedlium/s5_r2/local/chain/run_semisupervised.sh <https://github.com/kaldi-asr/kaldi/pull/1657/files#diff-0> (174) - *M* egs/tedlium/s5_r2/local/chain/tuning/run_tdnn_1d.sh <https://github.com/kaldi-asr/kaldi/pull/1657/files#diff-1> (6) - *M* egs/wsj/s5/steps/nnet3/chain/get_egs.sh <https://github.com/kaldi-asr/kaldi/pull/1657/files#diff-2> (4) - *M* src/chain/chain-supervision.cc <https://github.com/kaldi-asr/kaldi/pull/1657/files#diff-3> (54) - *M* src/chain/chain-supervision.h <https://github.com/kaldi-asr/kaldi/pull/1657/files#diff-4> (7) Patch Links: - https://github.com/kaldi-asr/kaldi/pull/1657.patch - https://github.com/kaldi-asr/kaldi/pull/1657.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1657>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu95nLsE6jrX905ynsajbxI4od51gks5r-0btgaJpZM4NprIG> .

hhadian · 2017-05-29T22:40:43Z

A potential problem with increasing the length of egs for unsupervised data is that they will be placed in different minibatches so that we won't have minibatches with mixed supervised/unsupervised egs. This might affect randomization in SGD in a bad way?

danpovey · 2017-05-29T22:41:35Z

I don't think it's a problem.

…

On Mon, May 29, 2017 at 6:40 PM, Hossein Hadian ***@***.***> wrote: A potential problem with increasing the length of egs for unsupervised data is that they will be placed in different minibatches so that we won't have minibatches with mixed supervised/unsupervised egs. This might affect randomization in SGD in a bad way? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVux0EBlT5Mn-1Tk7vJUD76dImn_TCks5r-0lvgaJpZM4NprIG> .

hhadian · 2017-05-30T22:42:56Z

New results (testing frames_per_eg=300 and more epochs):

# System                baseline   supervised   comb1a   comb1a_fpe300
# WER on dev(orig)            9.3      10.6      10.2      10.1
# WER on dev(rescored)        8.7       9.8       9.5       9.2
# WER on test(orig)           9.4      11.0      10.4      10.4
# WER on test(rescored)       9.0      10.3       9.8       9.9
# Final train prob        -0.0893   -0.0861   -0.0832   -0.0841
# Final valid prob        -0.1068   -0.1216   -0.1123   -0.1137
# Final train prob (xent)   -1.4083   -1.5612   -1.3885   -1.4014
# Final valid prob (xent)   -1.4601   -1.6863   -1.4917   -1.5180

comb1a is the same as combined1 in my previous comment except with 5 epochs instead if 4.
In these results recovery ranges from 35% to 54%.

hhadian · 2017-05-30T22:56:09Z

One important thing that I forgot to mention yesterday: Currently the way I split to supervised/unsupervised is based on utterances so there is speaker overlap. If there shouldn't be any speaker overlap, I'll need to change it.

danpovey · 2017-05-30T23:05:09Z

I don't think that matters.

…

On Tue, May 30, 2017 at 6:56 PM, Hossein Hadian ***@***.***> wrote: One important thing that I forgot to mention yesterday: Currently the way I split to supervised/unsupervised is based on utterances so there is speaker overlap. If there shouldn't be any speaker overlap, I'll need to change it. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu6SUKNil9TAvSduSfmeB2VaNBqrYks5r_J6LgaJpZM4NprIG> .

vimalmanohar · 2017-05-30T23:07:30Z

I think what people do normally is not have speaker overlaps (or recording overlap in case of tedlium).

On Tue, May 30, 2017, 19:05 Daniel Povey ***@***.***> wrote: I don't think that matters. On Tue, May 30, 2017 at 6:56 PM, Hossein Hadian ***@***.***> wrote: > One important thing that I forgot to mention yesterday: Currently the way > I split to supervised/unsupervised is based on utterances so there is > speaker overlap. If there shouldn't be any speaker overlap, I'll need to > change it. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#1657 (comment)>, or mute > the thread > < https://github.com/notifications/unsubscribe-auth/ADJVu6SUKNil9TAvSduSfmeB2VaNBqrYks5r_J6LgaJpZM4NprIG > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEATVyM8pe16QH5hQr0T8_0DcMR9joGZks5r_KCsgaJpZM4NprIG> .

-- Vimal Manohar PhD Student Electrical & Computer Engineering Johns Hopkins University

danpovey · 2017-05-30T23:17:09Z

Maybe best to do what Vimal says if it's the common practice, or it could cause problems with some reviewers. But not time-sensitive. On Tue, May 30, 2017 at 7:07 PM, Vimal Manohar <[email protected]> wrote:

…

I think what people do normally is not have speaker overlaps (or recording overlap in case of tedlium). On Tue, May 30, 2017, 19:05 Daniel Povey ***@***.***> wrote: > I don't think that matters. > > On Tue, May 30, 2017 at 6:56 PM, Hossein Hadian < ***@***.***> > wrote: > > > One important thing that I forgot to mention yesterday: Currently the way > > I split to supervised/unsupervised is based on utterances so there is > > speaker overlap. If there shouldn't be any speaker overlap, I'll need to > > change it. > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > <#1657 (comment)>, > or mute > > the thread > > < > https://github.com/notifications/unsubscribe-auth/ ADJVu6SUKNil9TAvSduSfmeB2VaNBqrYks5r_J6LgaJpZM4NprIG > > > > . > > > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1657 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe- auth/AEATVyM8pe16QH5hQr0T8_0DcMR9joGZks5r_KCsgaJpZM4NprIG> > . > -- Vimal Manohar PhD Student Electrical & Computer Engineering Johns Hopkins University — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuxnT6HZvmseZehFsDRjI7-xJor_Iks5r_KE3gaJpZM4NprIG> .

hhadian · 2017-05-31T16:43:27Z

OK will do. Results of testing different lm-scales:

# System                    lmsc0    lmsc0.1    lmsc0.3    lmsc1.0
# WER on dev(orig)           10.2      10.2      10.2      11.0
# WER on dev(rescored)        9.4       9.5       9.6      10.3
# WER on test(orig)          10.5      10.4      10.6      11.4
# WER on test(rescored)       9.9       9.8      10.0      10.5
# Final train prob        -0.0840   -0.0832   -0.0843   -0.0857
# Final valid prob        -0.1128   -0.1123   -0.1134   -0.1111
# Final train prob (xent)   -1.4019   -1.3885   -1.3976   -1.4325
# Final valid prob (xent)   -1.4901   -1.4917   -1.5097   -1.5283

jtrmal · 2017-05-31T16:51:59Z

I'm not sure if I understand what the numbers mean. Is it the scale applied to the lattice when generating the training examples? y

…

On Wed, May 31, 2017 at 12:43 PM, Hossein Hadian ***@***.***> wrote: OK will do. Results of testing different lm-scales: # System lmsc0 lmsc0.1 lmsc0.3 lmsc1.0 # WER on dev(orig) 10.2 10.2 10.2 11.0 # WER on dev(rescored) 9.4 9.5 9.6 10.3 # WER on test(orig) 10.5 10.4 10.6 11.4 # WER on test(rescored) 9.9 9.8 10.0 10.5 # Final train prob -0.0840 -0.0832 -0.0843 -0.0857 # Final valid prob -0.1128 -0.1123 -0.1134 -0.1111 # Final train prob (xent) -1.4019 -1.3885 -1.3976 -1.4325 # Final valid prob (xent) -1.4901 -1.4917 -1.5097 -1.5283 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKisXzN7OZVFUr_wNtrQk8KMJkkV8Q-6ks5r_Zi9gaJpZM4NprIG> .

hhadian · 2017-05-31T16:55:35Z

Yes, the scale is applied to the graph/lm weight of the unsupervised decoded lattices while generating chain supervision from them (the acoustic weights are ignored).

danpovey · 2017-05-31T18:11:53Z

That's interesting, but unexpected- I would have expected that a scale of 1.0 would be optimal.

…

On Wed, May 31, 2017 at 12:55 PM, Hossein Hadian ***@***.***> wrote: Yes, the scale is applied to the graph/lm weight of the unsupervised decoded lattices while generating chain supervision from them (the acoustic weights are ignored). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu6AQoCI8bOdr7lKk7wPjlzvGz4LCks5r_ZuKgaJpZM4NprIG> .

danpovey · 2017-05-31T23:23:18Z

src/chain/chain-supervision.cc

+        fst::StdArc(phone, phone,
+                    fst::TropicalWeight(lat_arc.weight.Weight().Value1()
+                                        * opts.lm_scale),
+                    lat_arc.nextstate));


Don't forget to include the weights when you process the final-probs.

danpovey · 2017-05-31T23:27:18Z

src/chain/chain-supervision.cc

+  }
+
+  CompactLattice::Arc::StateId start = clat->Start();  // Should be 0
+  BaseFloat total_backward_cost = beta[start];


this variable is not a cost, it is a negated cost (or a loglike), so the name isn't suitable. Fix it where you got the code from, as well.

danpovey · 2017-05-31T23:36:18Z

src/chain/chain-supervision.cc

+  for (fst::StateIterator<CompactLattice> sit(*clat); !sit.Done(); sit.Next()) {
+    CompactLatticeWeight f = clat->Final(sit.Value());
+    LatticeWeight w = f.Weight();
+    w.SetValue1(w.Value1() + total_backward_cost);


Actually I don't believe this code is quite right for this purpose.. The total_backward_cost can be interpreted as the negative of value1() + value2() of the total weight, and you are just adding it to the value1(). It's the value1() which we care about being normalized. But it won't end up normalized if you process it like this, it will have something with the magnitude of the value2() added to it. It would be OK if you were to zero out all the value2()'s first. But I think it would be easier if you were to do the normalization in a different way-- after converting it to an FST. You could compute the best-path cost using ShortestDistance, and then subtract that from all the final-probs.
This may require a slight refactoring of the code.

So do you mean I should convert it to FST, do the normalization, and the convert it back to lattice (all before calling PhoneLatticeToProtoSupervisionInternal) or alternatively do it on the fst of the resulting ProtoSupervision (i.e. after calling PhoneLatticeToProtoSupervisionInternal)?

danpovey · 2017-06-01T01:26:31Z

Do it on the fst of the resulting ProtoSupervision. Of course, no need to do it if the lm_scale is zero.

…

On Wed, May 31, 2017 at 9:23 PM, Hossein Hadian ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/chain/chain-supervision.cc <#1657 (comment)>: > + } + } + + std::vector<double> beta; + if (!ComputeCompactLatticeBetas(*clat, &beta)) { + KALDI_WARN << "Failed to compute backward probabilities on lattice."; + return false; + } + + CompactLattice::Arc::StateId start = clat->Start(); // Should be 0 + BaseFloat total_backward_cost = beta[start]; + + for (fst::StateIterator<CompactLattice> sit(*clat); !sit.Done(); sit.Next()) { + CompactLatticeWeight f = clat->Final(sit.Value()); + LatticeWeight w = f.Weight(); + w.SetValue1(w.Value1() + total_backward_cost); So do you mean I should convert it to FST, do the normalization, and the convert it back to lattice (all before calling PhoneLatticeToProtoSupervisionInternal) or alternatively do it on the fst of the resulting ProtoSupervision (i.e. after calling PhoneLatticeToProtoSupervisionInternal)? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuwyGHL5x1OvewQ7up1gw5V3DKP1bks5r_hKYgaJpZM4NprIG> .

hhadian · 2017-06-03T02:16:37Z

Results of trying different lattice beams (still on tedlium):

# System              baseline(200h) sup(50hr)  beam8.0    beam4.0  beam2.0
# WER on dev(orig)            9.3      10.6      10.2      9.8      9.7
# WER on dev(rescored)        8.7       9.8       9.4       9.2       9.0
# WER on test(orig)           9.4      11.0      10.5      10.4      9.9
# WER on test(rescored)       9.0      10.3       9.9       9.6       9.2
# Final train prob        -0.0893   -0.0861   -0.0840   -0.0836   -0.0848
# Final valid prob        -0.1068   -0.1216   -0.1128   -0.1131   -0.1118
# Final train prob (xent)   -1.4083   -1.5612   -1.4019   -1.4064   -1.4200
# Final valid prob (xent)   -1.4601   -1.6863   -1.4901   -1.5103   -1.5116

These are all with lm-scale = 0.0 (so they are not affected with how we normalize the lattices)
Also the WERs of decoding the unsupervised set is as follows:
beam8.0: %WER 7.50 [ 123226 / 1642860, 16571 ins, 41186 del, 65469 sub ] exp/chain_cleaned_semi/tdnn_sup1d_sp_bi/decode_train_cleaned_unsup_rescore/wer_11_0.0
beam2.0: %WER 7.89 [ 129656 / 1642860, 17612 ins, 41667 del, 70377 sub ] exp/chain_cleaned_semi/tdnn_sup1d_sp_bi/decode_train_cleaned_unsup_latbeam2.0_rescore/wer_11_0.0

I am currently running fisher_english, and I'll send the results as soon as they're ready.

danpovey · 2017-06-03T03:00:49Z

Interesting. Try lattice-beam of 1.0 and 0.1 as well. Of course it's a little disappointing if the best results are with the smallest lattice-beam, as it means the technique is no different from regular unsupervised training. Incidentally, I am reconsidering whether to remove the deriv-weight stuff, as if we want to do frame-dependent confidence weighting we might want to use that mechanism.

…

On Fri, Jun 2, 2017 at 10:16 PM, Hossein Hadian ***@***.***> wrote: Results of trying different lattice beams (still on tedlium): # System baseline(200h) sup(50hr) beam8.0 beam4.0 beam2.0 # WER on dev(orig) 9.3 10.6 10.2 9.8 9.7 # WER on dev(rescored) 8.7 9.8 9.4 9.2 9.0 # WER on test(orig) 9.4 11.0 10.5 10.4 9.9 # WER on test(rescored) 9.0 10.3 9.9 9.6 9.2 # Final train prob -0.0893 -0.0861 -0.0840 -0.0836 -0.0848 # Final valid prob -0.1068 -0.1216 -0.1128 -0.1131 -0.1118 # Final train prob (xent) -1.4083 -1.5612 -1.4019 -1.4064 -1.4200 # Final valid prob (xent) -1.4601 -1.6863 -1.4901 -1.5103 -1.5116 These are all with lm-scale = 0.0 (so they are not affected with how we normalize the lattices) Also the WERs of decoding the unsupervised set is as follows: *beam8.0*: %WER 7.50 [ 123226 / 1642860, 16571 ins, 41186 del, 65469 sub ] exp/chain_cleaned_semi/tdnn_sup1d_sp_bi/decode_train_ cleaned_unsup_rescore/wer_11_0.0 *beam2.0*: %WER 7.89 [ 129656 / 1642860, 17612 ins, 41667 del, 70377 sub ] exp/chain_cleaned_semi/tdnn_sup1d_sp_bi/decode_train_ cleaned_unsup_latbeam2.0_rescore/wer_11_0.0 I am currently running fisher_english, and I'll send the results as soon as they're ready. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu9vz7RZsr732ctl3_sOyO64sbVdiks5sAMIHgaJpZM4NprIG> .

danpovey · 2017-06-03T03:03:15Z

.. something to be careful with is the language model used for decoding. You should probably be excluding the unsupervised portion of the training data from that LM. If the training data is *all* the LM is trained on, the bias from this could be substantial; if it's mixed with other data, the bias might be less. This might affect which techniques work well.

…

On Fri, Jun 2, 2017 at 11:00 PM, Daniel Povey ***@***.***> wrote: Interesting. Try lattice-beam of 1.0 and 0.1 as well. Of course it's a little disappointing if the best results are with the smallest lattice-beam, as it means the technique is no different from regular unsupervised training. Incidentally, I am reconsidering whether to remove the deriv-weight stuff, as if we want to do frame-dependent confidence weighting we might want to use that mechanism. On Fri, Jun 2, 2017 at 10:16 PM, Hossein Hadian ***@***.***> wrote: > Results of trying different lattice beams (still on tedlium): > > # System baseline(200h) sup(50hr) beam8.0 beam4.0 beam2.0 > # WER on dev(orig) 9.3 10.6 10.2 9.8 9.7 > # WER on dev(rescored) 8.7 9.8 9.4 9.2 9.0 > # WER on test(orig) 9.4 11.0 10.5 10.4 9.9 > # WER on test(rescored) 9.0 10.3 9.9 9.6 9.2 > # Final train prob -0.0893 -0.0861 -0.0840 -0.0836 -0.0848 > # Final valid prob -0.1068 -0.1216 -0.1128 -0.1131 -0.1118 > # Final train prob (xent) -1.4083 -1.5612 -1.4019 -1.4064 -1.4200 > # Final valid prob (xent) -1.4601 -1.6863 -1.4901 -1.5103 -1.5116 > > These are all with lm-scale = 0.0 (so they are not affected with how we > normalize the lattices) > Also the WERs of decoding the unsupervised set is as follows: > *beam8.0*: %WER 7.50 [ 123226 / 1642860, 16571 ins, 41186 del, 65469 sub > ] exp/chain_cleaned_semi/tdnn_sup1d_sp_bi/decode_train_cleaned > _unsup_rescore/wer_11_0.0 > *beam2.0*: %WER 7.89 [ 129656 / 1642860, 17612 ins, 41667 del, 70377 sub > ] exp/chain_cleaned_semi/tdnn_sup1d_sp_bi/decode_train_cleaned > _unsup_latbeam2.0_rescore/wer_11_0.0 > > I am currently running fisher_english, and I'll send the results as soon > as they're ready. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#1657 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ADJVu9vz7RZsr732ctl3_sOyO64sbVdiks5sAMIHgaJpZM4NprIG> > . >

hhadian · 2017-06-03T03:18:10Z

I will look into language model. Another effective option is left/right-tolerance I guess; which I have not tuned and is set to 2 in all these experiments. Also the fact that there is speaker overlap might be affecting.

hhadian · 2017-06-05T00:11:44Z

Results of trying different lattice beams (cont'd):

# System                    beam0.1   beam1.0   beam2.0   beam4.0   beam8.0
# WER on dev(orig)            9.9       9.8       9.7       9.8      10.2
# WER on dev(rescored)        9.0       9.1       9.0       9.2       9.4
# WER on test(orig)          10.1      10.1       9.9      10.4      10.5
# WER on test(rescored)       9.6       9.5       9.2       9.6       9.9
# Final train prob        -0.0839   -0.0853   -0.0848   -0.0836   -0.0840
# Final valid prob        -0.1122   -0.1111   -0.1118   -0.1131   -0.1128
# Final train prob (xent)   -1.4249   -1.4217   -1.4200   -1.4064   -1.4019
# Final valid prob (xent)   -1.5211   -1.5153   -1.5116   -1.5103   -1.4901

Beam=2.0 has outperformed other beams.
Also regarding LM sources, I checked and it seems to me that only a very small part of the source is the training data text and the rest is from other sources.

danpovey · 2017-06-05T01:52:33Z

Interesting. It would be nice if you could try beams 2 and 4 in the setup where the chunks are larger for the unsupervised portion of the data. This might let us know whether the reason wider lattices aren't working well, is that we are not handling the edge effects in chunk-splitting very carefully.

…

On Sun, Jun 4, 2017 at 8:11 PM, Hossein Hadian ***@***.***> wrote: Results of trying different lattice beams (cont'd): # System beam0.1 beam1.0 beam2.0 beam4.0 beam8.0 # WER on dev(orig) 9.9 9.8 9.7 9.8 10.2 # WER on dev(rescored) 9.0 9.1 9.0 9.2 9.4 # WER on test(orig) 10.1 10.1 9.9 10.4 10.5 # WER on test(rescored) 9.6 9.5 9.2 9.6 9.9 # Final train prob -0.0839 -0.0853 -0.0848 -0.0836 -0.0840 # Final valid prob -0.1122 -0.1111 -0.1118 -0.1131 -0.1128 # Final train prob (xent) -1.4249 -1.4217 -1.4200 -1.4064 -1.4019 # Final valid prob (xent) -1.5211 -1.5153 -1.5116 -1.5103 -1.4901 Beam=2.0 has outperformed other beams. Also regarding LM sources, I checked and it seems to me that only a very small part of the source is the training data text and the rest is from other sources. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuwEmipKvw3Wpgcmsw1WMI2MwR-BBks5sA0fDgaJpZM4NprIG> .

hhadian · 2017-06-05T02:18:39Z

Will do.
Also I tried lm-scale=1.0 once again after the normalization fix:

# System                lmsc1.0_fix1 lmsc1.0
# WER on dev(orig)           10.9      11.0
# WER on dev(rescored)       10.2      10.3
# WER on test(orig)          11.1      11.4
# WER on test(rescored)      10.5      10.5
# Final train prob        -0.0860   -0.0857
# Final valid prob        -0.1120   -0.1111
# Final train prob (xent)   -1.4273   -1.4325
# Final valid prob (xent)   -1.5188   -1.5283

There was some improvement but still lm-scale=0.0 is better. Here is the results of trying tolerance = 1 and 2. (Tolerance 0 leads to an assertion failure: left_tol+right_tol >= frame_subsampling_factor)

# System                tol1(beam4.0) tol2(beam4.0)
# WER on dev(orig)           10.0       9.8
# WER on dev(rescored)        9.4       9.2
# WER on test(orig)          10.1      10.4
# WER on test(rescored)       9.5       9.6
# Final train prob        -0.0830   -0.0836
# Final valid prob        -0.1105   -0.1131
# Final train prob (xent)   -1.3549   -1.4064
# Final valid prob (xent)   -1.4419   -1.5103

I might need to try it with other beams to get conclusive results. I guess it's better to try them on fisher_english as the setup is almost ready now.

…vised Travis was failing to compile(not sure why)-- I used the "Update Branch" button

hhadian · 2017-06-20T18:09:53Z

The results on fisher_english (supervised: 100hr, unsupervised: 250hr):
Effect of pruning (lm-scale=0.0, tolerance=1):

# System              baseline350h    sup100hr    prun1    prun2   prun4
# WER on dev                 17.74     20.03     19.61     19.45     19.84
# WER on test                17.57     20.20     19.38     19.25     19.53
# Final train prob          -0.1128   -0.1064   -0.1119   -0.1096   -0.1094
# Final valid prob          -0.1251   -0.1501   -0.1505   -0.1500   -0.1527
# Final train prob (xent)   -1.7908   -1.7571   -1.5977   -1.5903   -1.5843
# Final valid prob (xent)   -1.7712   -1.9253   -1.7578   -1.7430   -1.7435

Effect of pruning when we exclude the unsupervised data text from LM:

# System                  base350hr sup100hr exLM_prun0.1 exLM_prun1 exLM_prun2 exLM_prun4
# WER on dev                 17.74     20.03     19.93     19.71     19.55     19.96
# WER on test                17.57     20.20     19.55     19.62     19.43     19.69
# Final train prob          -0.1128   -0.1064   -0.1117   -0.1105   -0.1112   -0.1088
# Final valid prob          -0.1251   -0.1501   -0.1512   -0.1519   -0.1511   -0.1529
# Final train prob (xent)   -1.7908   -1.7571   -1.6015   -1.6052   -1.5912   -1.5785
# Final valid prob (xent)   -1.7712   -1.9253   -1.7635   -1.7613   -1.7379   -1.7350

Comparison of decoding WER of the unsupervised data with the full LM vs the LM excluding unsuervised text:

full-LM: %WER 17.94 [ 489687 / 2730165, 50624 ins, 164552 del, 274511 sub ] exp/chain_semi350k/tdnn_xxsup1a_sp/decode_train_unsup250k/wer_10
LM excluding unsup text: %WER 20.58 [ 561823 / 2730165, 58475 ins, 175688 del, 327660 sub ] exp/chain_semi350k/tdnn_xxsup1a_sp/decode_train_unsup250k_exLM/wer_10

hhadian · 2017-06-23T21:29:12Z

Results of trying smaller egs weight for the unsupervised data:

# System                base350hr sup100hr egs_wt1.0 egs_wt0.75 egs_wt0.5
# WER on dev               17.74     20.03     19.55     19.70     19.57
# WER on test               17.57     20.20     19.43     19.55     19.57
# Final train prob          -0.1128   -0.1064   -0.1112   -0.1105   -0.1104
# Final valid prob          -0.1251   -0.1501   -0.1511   -0.1514   -0.1506
# Final train prob (xent)   -1.7908   -1.7571   -1.5912   -1.5961   -1.5940
# Final valid prob (xent)   -1.7712   -1.9253   -1.7379   -1.7428   -1.7475

Briefly put, no gain from changing unsupervised egs weight.

Results of testing frames-per-eg=300 (i.e. 2x longer sequences) for unsupervised egs.

# System                base350hr sup100hr sup100hr_mb64 prun1_fpe300 prun2_fpe300 prun4_fpe300
# WER on dev               17.74     20.03     20.23     19.25     19.12     19.34
# WER on test               17.57     20.20     19.90     19.32     19.13     19.26
# Final train prob          -0.1128   -0.1064   -0.1017   -0.1028   -0.1015   -0.1019
# Final valid prob          -0.1251   -0.1501   -0.1466   -0.1422   -0.1445   -0.1431
# Final train prob (xent)   -1.7908   -1.7571   -1.7074   -1.5801   -1.5775   -1.5911
# Final valid prob (xent)   -1.7712   -1.9253   -1.8770   -1.7479   -1.7592   -1.7388

Briefly put, frames_per_eg=300 has obviously improved the results (by 0.4% absolute) and still prune-beam 2 is better. Currently, best recovery rate is:
Test: 44%
Dev: 33%
[sup100hr_mb64 is the same as sup100hr but with minibatch-size=64 instead of 128 to make it comparable to fpe300 experiments which use minibatch-size 64]

danpovey · 2017-06-23T21:36:28Z

You can use different minibatch-size for different lengths of egs for the experiment with long egs, e.g.:

…

--minibatch-size='150=128,64:300=64,32'

On Fri, Jun 23, 2017 at 5:29 PM, Hossein Hadian ***@***.***> wrote: Results of trying smaller egs weight for the unsupervised data: # System base350hr sup100hr egs_wt1.0 egs_wt0.75 egs_wt0.5 # WER on dev 17.74 20.03 19.55 19.70 19.57 # WER on test 17.57 20.20 19.43 19.55 19.57 # Final train prob -0.1128 -0.1064 -0.1112 -0.1105 -0.1104 # Final valid prob -0.1251 -0.1501 -0.1511 -0.1514 -0.1506 # Final train prob (xent) -1.7908 -1.7571 -1.5912 -1.5961 -1.5940 # Final valid prob (xent) -1.7712 -1.9253 -1.7379 -1.7428 -1.7475 Briefly put, no gain from changing unsupervised egs weight. Results of testing frames-per-eg=300 (i.e. 2x longer sequences) for unsupervised egs. # System base350hr sup100hr sup100hr_mb64 prun1_fpe300 prun2_fpe300 prun4_fpe300 # WER on dev 17.74 20.03 20.23 19.25 19.12 19.34 # WER on test 17.57 20.20 19.90 19.32 19.13 19.26 # Final train prob -0.1128 -0.1064 -0.1017 -0.1028 -0.1015 -0.1019 # Final valid prob -0.1251 -0.1501 -0.1466 -0.1422 -0.1445 -0.1431 # Final train prob (xent) -1.7908 -1.7571 -1.7074 -1.5801 -1.5775 -1.5911 # Final valid prob (xent) -1.7712 -1.9253 -1.8770 -1.7479 -1.7592 -1.7388 Briefly put, frames_per_eg=300 has obviously improved the results and still prune-beam 2 is better. Currently, best recovery rate is: Test: 44% Dev: 33% [sup100hr_mb64 is the same as sup100hr but with minibatch-size=64 instead of 128 to make it comparable to fpe300 experiments which use minibatch-size 64] — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu7Q8rcoxHQn1DJeGcaxsly9vM5aIks5sHC4ugaJpZM4NprIG> .

hhadian · 2017-06-24T06:12:31Z

I will use it.
Results for tolerances 0, 1, and 2 (lm-scale=0):

# System                base350hr sup100hr prun2_tol0 prun2_tol1 prun2_tol2
# WER on dev               17.74     20.03     20.57     19.55     19.72
# WER on test               17.57     20.20     20.18     19.43     19.56
# Final train prob          -0.1128   -0.1064   -0.1151   -0.1112   -0.1112
# Final valid prob          -0.1251   -0.1501   -0.1558   -0.1511   -0.1507
# Final train prob (xent)   -1.7908   -1.7571   -1.5276   -1.5912   -1.6506
# Final valid prob (xent)   -1.7712   -1.9253   -1.6889   -1.7379   -1.8018

danpovey · 2017-06-24T19:54:18Z

Interesting. Possibly tolerance=0 doesn't work well because, even though the lattices were generated with the chain system and the alignments in the lattices should be good, we shift the acoustics left and right by -1 and 1 frame (really -1/3 and +1/3 of a frame, at the output frame rate) during training.

…

On Sat, Jun 24, 2017 at 2:12 AM, Hossein Hadian ***@***.***> wrote: I will use it. Results for tolerances 0, 1, and 2 (lm-scale=0): # System base350hr sup100hr prun2_tol0 prun2_tol1 prun2_tol2 # WER on dev 17.74 20.03 20.57 19.55 19.72 # WER on test 17.57 20.20 20.18 19.43 19.56 # Final train prob -0.1128 -0.1064 -0.1151 -0.1112 -0.1112 # Final valid prob -0.1251 -0.1501 -0.1558 -0.1511 -0.1507 # Final train prob (xent) -1.7908 -1.7571 -1.5276 -1.5912 -1.6506 # Final valid prob (xent) -1.7712 -1.9253 -1.6889 -1.7379 -1.8018 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu26AP-DMkcBKJTFdPuQnlebCRFKlks5sHKjSgaJpZM4NprIG> .

hhadian · 2017-06-27T19:15:23Z

Results of trying phone insertion penalties -0.5 and -1:

# System                  base350hr  sup100hr   pip0   pip-0.5   pip-1
# WER on dev                17.74     20.03     19.55     19.50     19.40
# WER on test               17.57     20.20     19.43     19.19     19.23
# Final train prob          -0.1128   -0.1064   -0.1112   -0.1121   -0.1116
# Final valid prob          -0.1251   -0.1501   -0.1511   -0.1504   -0.1521
# Final train prob (xent)   -1.7908   -1.7571   -1.5912   -1.5899   -1.5976
# Final valid prob (xent)   -1.7712   -1.9253   -1.7379   -1.7424   -1.7533

There is a bit improvement (.1 on dev and .3 on test), so I combined it with frames-per-eg=300 which was helpful but no extra gain was achieved. So the avg recovery rate is still ~40%.
I guess we should now try these with a smaller supervised set like 30hr or even less.

vimalmanohar · 2017-07-12T19:52:01Z

Using confidences gives around 0.2-0.3% improvement, but its noisy and improvements are different on dev and test.

# System                         b               conf_a_wgt0.5   conf_b_wgt0.3  
# WER on dev                     19.24           19.09           19.03          
# WER on test                    18.91           18.67           18.86          
# Final train prob               -0.1131         -0.1111         -0.1053        
# Final valid prob               -0.1539         -0.1526         -0.1477        
# Final train prob (xent)        -64.8031        -1.6046         -1.5923        
# Final valid prob (xent)        -70.3492        -1.7545         -1.7555

danpovey · 2017-07-12T20:12:34Z

Interesting. If we saw improvements on another dataset I might be inclined to believe it. Also I'd be interested to know whether using the raw, uncalibrated confidences might be sufficient. I'd rather not have a dependency on the confidence-calibration script. Look at Karel's publications on unsupervised training and see what he did, I think he has worked with this type of thing.

…

On Wed, Jul 12, 2017 at 3:52 PM, Vimal Manohar ***@***.***> wrote: Using confidences gives around 0.2-0.3% improvement, but its noisy and improvements are different on dev and test. System b conf_a_wgt0.5 conf_b_wgt0.3 WER on dev 19.24 19.09 19.03 WER on test 18.91 18.67 18.86 $ Final train prob -0.1131 -0.1111 -0.1053 $ Final valid prob -0.1539 -0.1526 -0.1477 $ Final train prob (xent) -64.8031 -1.6046 -1.5923 $ Final valid prob (xent) -70.3492 -1.7545 -1.7555 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu0TdeqMU1sO4SGugMCbIQi1RiBfnks5sNSPlgaJpZM4NprIG> .

vimalmanohar · 2017-07-12T20:29:24Z

I didn't find anything in Karel's publications. It's perhaps not published yet. On Wed, Jul 12, 2017 at 4:12 PM Daniel Povey <[email protected]> wrote:

Interesting. If we saw improvements on another dataset I might be inclined to believe it. Also I'd be interested to know whether using the raw, uncalibrated confidences might be sufficient. I'd rather not have a dependency on the confidence-calibration script. Look at Karel's publications on unsupervised training and see what he did, I think he has worked with this type of thing. On Wed, Jul 12, 2017 at 3:52 PM, Vimal Manohar ***@***.***> wrote: > Using confidences gives around 0.2-0.3% improvement, but its noisy and > improvements are different on dev and test. > System b conf_a_wgt0.5 conf_b_wgt0.3 WER on dev 19.24 19.09 19.03 WER on > test 18.91 18.67 18.86 > > $ Final train prob -0.1131 -0.1111 -0.1053 > $ Final valid prob -0.1539 -0.1526 -0.1477 > $ Final train prob (xent) -64.8031 -1.6046 -1.5923 > $ Final valid prob (xent) -70.3492 -1.7545 -1.7555 > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#1657 (comment)>, or mute > the thread > < https://github.com/notifications/unsubscribe-auth/ADJVu0TdeqMU1sO4SGugMCbIQi1RiBfnks5sNSPlgaJpZM4NprIG > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEATVwRksGFRNBNOYjdDMpqScqYUBQy7ks5sNSi6gaJpZM4NprIG> .

-- Vimal Manohar PhD Student Electrical & Computer Engineering Johns Hopkins University

danpovey · 2017-07-12T20:34:14Z

Look here http://www.fit.vutbr.cz/research/groups/speech/publi/2013/vesely_asru2013_0000267.pdf As far as I can tell, he is using the posteriors from the MBR code, without any calibration. On Wed, Jul 12, 2017 at 4:29 PM, Vimal Manohar <[email protected]> wrote:

…

I didn't find anything in Karel's publications. It's perhaps not published yet. On Wed, Jul 12, 2017 at 4:12 PM Daniel Povey ***@***.***> wrote: > Interesting. If we saw improvements on another dataset I might be inclined > to believe it. > Also I'd be interested to know whether using the raw, uncalibrated > confidences might be sufficient. I'd rather not have a dependency on the > confidence-calibration script. Look at Karel's publications on > unsupervised training and see what he did, I think he has worked with this > type of thing. > > > On Wed, Jul 12, 2017 at 3:52 PM, Vimal Manohar ***@***.*** > > wrote: > > > Using confidences gives around 0.2-0.3% improvement, but its noisy and > > improvements are different on dev and test. > > System b conf_a_wgt0.5 conf_b_wgt0.3 WER on dev 19.24 19.09 19.03 WER on > > test 18.91 18.67 18.86 > > > > $ Final train prob -0.1131 -0.1111 -0.1053 > > $ Final valid prob -0.1539 -0.1526 -0.1477 > > $ Final train prob (xent) -64.8031 -1.6046 -1.5923 > > $ Final valid prob (xent) -70.3492 -1.7545 -1.7555 > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > <#1657 (comment)>, > or mute > > the thread > > < > https://github.com/notifications/unsubscribe-auth/ ADJVu0TdeqMU1sO4SGugMCbIQi1RiBfnks5sNSPlgaJpZM4NprIG > > > > . > > > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1657 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ AEATVwRksGFRNBNOYjdDMpqScqYUBQy7ks5sNSi6gaJpZM4NprIG> > . > -- Vimal Manohar PhD Student Electrical & Computer Engineering Johns Hopkins University — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu4IBCZKCwqleYrvFT1fRiENAyroYks5sNSypgaJpZM4NprIG> .

vimalmanohar · 2017-07-12T20:39:47Z

Sure, but this is an old paper. It does not compare the MBR posteriors or lattice-posteriors vs the calibrated confidences.

danpovey · 2017-07-12T20:47:19Z

Sure. But anything that increases the complexity of the system, I need to see evidence that the improvement it gives is in proportional with the complexity.

…

On Wed, Jul 12, 2017 at 4:39 PM, Vimal Manohar ***@***.***> wrote: Sure, but this is an old paper. It does not compare the MBR posteriors or lattice-posteriors vs the calibrated confidences. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuy3M-jKGgMkAp5_ALqLtsgIHQOQ6ks5sNS8WgaJpZM4NprIG> .

hhadian · 2017-07-12T21:46:52Z

Results for a small supervised set (20k utterances ~16hr) on fisher_english:
Tolerance is 1, lm-weight is 0 and frames-per-eg is 300.

# System                fullset270k   sup20k   semi_prune1  semi_prune2 semi_prune4
# WER on dev                18.07     27.44     24.42     23.82     24.29
# WER on test               18.35     26.54     23.83     23.65     23.65
# Final train prob          -0.1129   -0.1233   -0.0998   -0.0864   -0.1047
# Final valid prob          -0.1515   -0.1606   -0.1443   -0.1326   -0.1437
# Final train prob (xent)   -1.7665   -1.9930   -1.5169   -1.4797   -1.4869
# Final valid prob (xent)   -1.9411   -2.0966   -1.6458   -1.6352   -1.6162

Avg recovery rate is ~33%. One difference in this case is that lattice-beam 4 is better than lattice-beam 1 (both are still worse than lattice-beam 2). In the 100hr supervised set, lattice-beam 1 was similar or better than 4.

vimalmanohar · 2017-07-16T21:01:01Z

Using uncalibrated confidences gives similar performance to calibrated confidence and both are better than not using deriv weights (which is the baseline). Also training phone LM using both unsupervised data best path and supervision alignments makes very little difference.

# System                         baseline             calibrated_conf      uncalibrated_conf    phone_lm_weights
# WER on dev                     19.24                19.09                18.94                18.67
# WER on test                    18.91                18.67                18.72                18.91
# Final train prob               -0.1131              -0.1111              -0.1105              -0.1134
# Final valid prob               -0.1539              -0.1526              -0.1528              -0.1549
# Final train prob (xent)        -64.8031             -1.6046              -1.5941              -1.5997
# Final valid prob (xent)        -70.3492             -1.7545              -1.7445              -1.7569

danpovey · 2017-07-16T21:17:08Z

OK, cool. So let's prefer the simpler recipe. by "training LM..", do you mean the LM used for decoding, or the phone LM used for chain training?

…

On Sun, Jul 16, 2017 at 5:01 PM, Vimal Manohar ***@***.***> wrote: Using uncalibrated confidences gives similar performance to calibrated confidence and both are better than not using deriv weights. Also training LM using both unsupervised data best path and supervision alignments makes very little difference. # System baseline calibrated_conf uncalibrated_conf phone_lm_weights # WER on dev 19.24 19.09 18.94 18.67 # WER on test 18.91 18.67 18.72 18.91 # Final train prob -0.1131 -0.1111 -0.1105 -0.1134 # Final valid prob -0.1539 -0.1526 -0.1528 -0.1549 # Final train prob (xent) -64.8031 -1.6046 -1.5941 -1.5997 # Final valid prob (xent) -70.3492 -1.7545 -1.7445 -1.7569 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu-d4s-0OGVVOk1NtWYr-JJROMzfxks5sOnoPgaJpZM4NprIG> .

vimalmanohar · 2017-07-16T21:23:24Z

OK, cool. So let's prefer the simpler recipe. by "training LM..", do you mean the LM used for decoding, or the phone LM used for chain training?

Phone LM for chain training.

hhadian · 2017-07-16T21:42:19Z

I tried final model combination step with mixed supervised+unsupervised egs (previously it was using only supervised egs) and the results did not change (test got better by 0.1% and dev got worse by 0.1%). [I only tried this on the best case where prune=2, lmwt=0, tol=1, unsup_frames_per_eg=300]
So I guess the improvement in Vimal's results (0.2-0.4) is all due to multi-task training. I think we should try it on the 16hr supervised set too. I am not sure how multi-task training is doing better than training on combined&shuffled supervised+unsuoervised egs.

danpovey · 2017-07-16T21:45:16Z

If it turns out that it's not necessary to use the multi-task setup, I'd be happier to check in something that doesn't use it. There are other ways to weight the datasets, e.g. do it while generating egs. It isn't quite as flexible but there are less ways for it to go wrong. Dan

…

On Sun, Jul 16, 2017 at 5:42 PM, Hossein Hadian ***@***.***> wrote: I tried final model combination step with mixed supervised+unsupervised egs (previously it was using only supervised egs) and the results did not change (test got better by 0.1% and dev got worse by 0.1%). [I only tried this on the best case where prune=2, lmwt=0, tol=1, unsup_frames_per_eg=300] So I guess the improvement in Vimal's results (0.2-0.4) is all due to multi-task training. I think we should try it on the 16hr supervised set too. I am not sure how multi-task training is doing better than training on combined&shuffled supervised+unsuoervised egs. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1657 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu1X8WPYuEbncBN0Un9jglydbNUn2ks5sOoO9gaJpZM4NprIG> .

hhadian · 2017-07-16T22:10:36Z

Actually, I guess it is not about weighting since Vimal's best results (and my best results) are with weight 1.0 for both unsupervised and supervised egs. Maybe it's the order in which we see the egs while training.
Hossein

KarelVesely84 · 2017-07-24T12:42:51Z

Hi, I'll have a new paper about semi-supervised training at INTERSPEECH'17.
https://drive.google.com/open?id=0B5FTXafjWqpIWlh6cUl2S1c0Ums

In my experiments I saw that that the confidence calibration usually did not lead to better SST results. The calibration is handy for presenting the recognition output. But, the 'ideally calibrated confidences', are usually not the same as the 'best confidences for SST'.

nshmyrev · 2017-07-24T12:51:47Z

Confidence for semi-supervised selection is overall not a good idea, there are publications like http://www.cs.cmu.edu/%7Erongz/icassp_2006.pdf

KarelVesely84 · 2017-07-24T13:34:26Z

Thanks for the paper. Well, i would say, it depends... :) It is true that the most confident data are not those which help the most in SST : https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/SemiSupervised-Interspeech2013_pub.pdf (Figure 3). While a 'careful' data selection can improve the results, compared to 'all-data-in' training... (my new paper https://drive.google.com/open?id=0B5FTXafjWqpIWlh6cUl2S1c0Ums). And yes, 'not-so-careful' data selection cause a WER degradation...

hhadian · 2018-04-26T20:31:43Z

Closing this PR as the other PR #2140 by Vimal has been merged.

hhadian added 2 commits May 29, 2017 02:27

[WIP] Add chain semi-supervised script + src changes

55cc6f9

Minor fixes

b91276b

Add more options to run_semisupervised.sh

ba26120

Add a check in supervision code

477bdf3

danpovey reviewed May 31, 2017

View reviewed changes

Some fixes + new options

f55f686

Merge branch 'master' into semi_supervised

c6ffb15

hhadian added 2 commits June 6, 2017 12:38

Add nnet3, chain, and semi_sepervised scripts for fisher english

403e3e2

Merge remote-tracking branch 'origin/semi_supervised' into semi_super…

0c8974e

…vised Travis was failing to compile(not sure why)-- I used the "Update Branch" button

hhadian added 2 commits June 29, 2017 15:01

Add phone-insertion-penalty + minor updates

55d3321

Some updates to run_tdnn.sh + fix the issue with combine-egs

86fd4b2

minor fix

9082b33

hhadian closed this Apr 26, 2018

[WIP] Semi-supervised training using chain models #1657

[WIP] Semi-supervised training using chain models #1657

Uh oh!

Conversation

hhadian commented May 29, 2017

Uh oh!

danpovey commented May 29, 2017 via email

Uh oh!

hhadian commented May 29, 2017

Uh oh!

danpovey commented May 29, 2017 via email

Uh oh!

hhadian commented May 30, 2017

Uh oh!

hhadian commented May 30, 2017

Uh oh!

danpovey commented May 30, 2017 via email

Uh oh!

vimalmanohar commented May 30, 2017 via email

Uh oh!

danpovey commented May 30, 2017 via email

Uh oh!

hhadian commented May 31, 2017

Uh oh!

jtrmal commented May 31, 2017 via email

Uh oh!

hhadian commented May 31, 2017

Uh oh!

danpovey commented May 31, 2017 via email

Uh oh!

danpovey May 31, 2017

Choose a reason for hiding this comment

Uh oh!

hhadian Jun 1, 2017

Choose a reason for hiding this comment

Uh oh!

danpovey May 31, 2017

Choose a reason for hiding this comment

Uh oh!

danpovey May 31, 2017

Choose a reason for hiding this comment

Uh oh!

hhadian Jun 1, 2017

Choose a reason for hiding this comment

Uh oh!

danpovey commented Jun 1, 2017 via email

Uh oh!

hhadian commented Jun 3, 2017

Uh oh!

danpovey commented Jun 3, 2017 via email

Uh oh!

danpovey commented Jun 3, 2017 via email

Uh oh!

hhadian commented Jun 3, 2017

Uh oh!

hhadian commented Jun 5, 2017

Uh oh!

danpovey commented Jun 5, 2017 via email

Uh oh!

hhadian commented Jun 5, 2017

Uh oh!

hhadian commented Jun 20, 2017

Uh oh!

hhadian commented Jun 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danpovey commented Jun 23, 2017 via email

Uh oh!

hhadian commented Jun 24, 2017

Uh oh!

danpovey commented Jun 24, 2017 via email

Uh oh!

hhadian commented Jun 27, 2017

Uh oh!

vimalmanohar commented Jul 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danpovey commented Jul 12, 2017 via email

Uh oh!

vimalmanohar commented Jul 12, 2017 via email

hhadian commented Jun 23, 2017 •

edited

Loading

vimalmanohar commented Jul 12, 2017 •

edited

Loading

vimalmanohar commented Jul 16, 2017 •

edited

Loading

KarelVesely84 commented Jul 24, 2017 •

edited

Loading