-
Notifications
You must be signed in to change notification settings - Fork 5.4k
[WIP] Semi-supervised training using chain models #1657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Cool.
I'd rather have @vimalmanohar review this for now, than getting into it
myself.
Something I think will be critical, is the splitting of the egs. When
Vimal split lattices for discriminative training, he put special attention
on the initial/final probs where they were split. I think what is going on
here is much simpler, as the acoustic scores are discarded long before the
lattices get split. It might lead to a situation where a too-deep lattice
would actually give you a degradation because you have too many
low-likelihood paths active near the edges of the chunks.
Anyway, the degradation from this should decrease as the egs from the
unsupervised part get longer, so if you try doubling the length of the egs
from the unsupervised part, it will give us a sense of how important this
issue is.
…On Mon, May 29, 2017 at 6:30 PM, Hossein Hadian ***@***.***> wrote:
I have tried to write a rather generic script local/chain/run_
semisupervised.sh which can be easily used for other setups and configs.
Please let me know if anything needs to be fixed/improved. I'm working on
it and will push more changes whenever ready.
Current initial results with left/right-tolerance=2 and
lattice-lm-scale=0.1 (nothing is tuned):
# System baseline(200h) sup(50h) combined1 combined2
# WER on dev(orig) 9.3 10.6 10.3 10.3
# WER on dev(rescored) 8.7 9.8 9.6 9.6
# WER on test(orig) 9.4 11.0 10.6 10.5
# WER on test(rescored) 9.0 10.3 10.0 9.8
# Final train prob -0.0893 -0.0861 -0.0882 -0.0890
# Final valid prob -0.1068 -0.1216 -0.1124 -0.1142
# Final train prob (xent) -1.4083 -1.5612 -1.4318 -1.4369
# Final valid prob (xent) -1.4601 -1.6863 -1.5244 -1.5223
combined2 means the "decoding of unsupervised data and combining and
retraining" has been performed for a second time (using combined1 model).
Recovery of degaradation is from ~25% to ~38%.
Note: Supervised training is done almost only using the supervised data
(i.e. 50h, even the ivector extractor and ivectors) except the initial
fmllr alignment of the supervised data which I did using the tri3 gmm model
trained on the whole dataset (I imagined it wouldn't make a difference).
Also, I did not use Pegah's egs-combination scripts because I realized a
simple nnet3-chain-copy-egs command could do it.
@vimalmanohar <https://github.com/vimalmanohar> @pegahgh
<https://github.com/pegahgh>
------------------------------
You can view, comment on, or merge this pull request online at:
#1657
Commit Summary
- [WIP] Add chain semi-supervised script + src changes
- Minor fixes
File Changes
- *A* egs/tedlium/s5_r2/local/chain/run_semisupervised.sh
<https://github.com/kaldi-asr/kaldi/pull/1657/files#diff-0> (174)
- *M* egs/tedlium/s5_r2/local/chain/tuning/run_tdnn_1d.sh
<https://github.com/kaldi-asr/kaldi/pull/1657/files#diff-1> (6)
- *M* egs/wsj/s5/steps/nnet3/chain/get_egs.sh
<https://github.com/kaldi-asr/kaldi/pull/1657/files#diff-2> (4)
- *M* src/chain/chain-supervision.cc
<https://github.com/kaldi-asr/kaldi/pull/1657/files#diff-3> (54)
- *M* src/chain/chain-supervision.h
<https://github.com/kaldi-asr/kaldi/pull/1657/files#diff-4> (7)
Patch Links:
- https://github.com/kaldi-asr/kaldi/pull/1657.patch
- https://github.com/kaldi-asr/kaldi/pull/1657.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1657>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu95nLsE6jrX905ynsajbxI4od51gks5r-0btgaJpZM4NprIG>
.
|
|
A potential problem with increasing the length of egs for unsupervised data is that they will be placed in different minibatches so that we won't have minibatches with mixed supervised/unsupervised egs. This might affect randomization in SGD in a bad way? |
|
I don't think it's a problem.
…On Mon, May 29, 2017 at 6:40 PM, Hossein Hadian ***@***.***> wrote:
A potential problem with increasing the length of egs for unsupervised
data is that they will be placed in different minibatches so that we won't
have minibatches with mixed supervised/unsupervised egs. This might affect
randomization in SGD in a bad way?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVux0EBlT5Mn-1Tk7vJUD76dImn_TCks5r-0lvgaJpZM4NprIG>
.
|
|
New results (testing frames_per_eg=300 and more epochs):
|
|
One important thing that I forgot to mention yesterday: Currently the way I split to supervised/unsupervised is based on utterances so there is speaker overlap. If there shouldn't be any speaker overlap, I'll need to change it. |
|
I don't think that matters.
…On Tue, May 30, 2017 at 6:56 PM, Hossein Hadian ***@***.***> wrote:
One important thing that I forgot to mention yesterday: Currently the way
I split to supervised/unsupervised is based on utterances so there is
speaker overlap. If there shouldn't be any speaker overlap, I'll need to
change it.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu6SUKNil9TAvSduSfmeB2VaNBqrYks5r_J6LgaJpZM4NprIG>
.
|
|
I think what people do normally is not have speaker overlaps (or recording
overlap in case of tedlium).
On Tue, May 30, 2017, 19:05 Daniel Povey ***@***.***> wrote:
I don't think that matters.
On Tue, May 30, 2017 at 6:56 PM, Hossein Hadian ***@***.***>
wrote:
> One important thing that I forgot to mention yesterday: Currently the way
> I split to supervised/unsupervised is based on utterances so there is
> speaker overlap. If there shouldn't be any speaker overlap, I'll need to
> change it.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1657 (comment)>,
or mute
> the thread
> <
https://github.com/notifications/unsubscribe-auth/ADJVu6SUKNil9TAvSduSfmeB2VaNBqrYks5r_J6LgaJpZM4NprIG
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEATVyM8pe16QH5hQr0T8_0DcMR9joGZks5r_KCsgaJpZM4NprIG>
.
--
Vimal Manohar
PhD Student
Electrical & Computer Engineering
Johns Hopkins University
|
|
Maybe best to do what Vimal says if it's the common practice, or it could
cause problems with some reviewers. But not time-sensitive.
On Tue, May 30, 2017 at 7:07 PM, Vimal Manohar <[email protected]>
wrote:
… I think what people do normally is not have speaker overlaps (or recording
overlap in case of tedlium).
On Tue, May 30, 2017, 19:05 Daniel Povey ***@***.***> wrote:
> I don't think that matters.
>
> On Tue, May 30, 2017 at 6:56 PM, Hossein Hadian <
***@***.***>
> wrote:
>
> > One important thing that I forgot to mention yesterday: Currently the
way
> > I split to supervised/unsupervised is based on utterances so there is
> > speaker overlap. If there shouldn't be any speaker overlap, I'll need
to
> > change it.
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <#1657 (comment)>,
> or mute
> > the thread
> > <
> https://github.com/notifications/unsubscribe-auth/
ADJVu6SUKNil9TAvSduSfmeB2VaNBqrYks5r_J6LgaJpZM4NprIG
> >
> > .
> >
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1657 (comment)>,
or mute
> the thread
> <https://github.com/notifications/unsubscribe-
auth/AEATVyM8pe16QH5hQr0T8_0DcMR9joGZks5r_KCsgaJpZM4NprIG>
> .
>
--
Vimal Manohar
PhD Student
Electrical & Computer Engineering
Johns Hopkins University
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuxnT6HZvmseZehFsDRjI7-xJor_Iks5r_KE3gaJpZM4NprIG>
.
|
|
OK will do. Results of testing different lm-scales: |
|
I'm not sure if I understand what the numbers mean. Is it the scale applied
to the lattice when generating the training examples?
y
…On Wed, May 31, 2017 at 12:43 PM, Hossein Hadian ***@***.***> wrote:
OK will do. Results of testing different lm-scales:
# System lmsc0 lmsc0.1 lmsc0.3 lmsc1.0
# WER on dev(orig) 10.2 10.2 10.2 11.0
# WER on dev(rescored) 9.4 9.5 9.6 10.3
# WER on test(orig) 10.5 10.4 10.6 11.4
# WER on test(rescored) 9.9 9.8 10.0 10.5
# Final train prob -0.0840 -0.0832 -0.0843 -0.0857
# Final valid prob -0.1128 -0.1123 -0.1134 -0.1111
# Final train prob (xent) -1.4019 -1.3885 -1.3976 -1.4325
# Final valid prob (xent) -1.4901 -1.4917 -1.5097 -1.5283
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AKisXzN7OZVFUr_wNtrQk8KMJkkV8Q-6ks5r_Zi9gaJpZM4NprIG>
.
|
|
Yes, the scale is applied to the graph/lm weight of the unsupervised decoded lattices while generating chain supervision from them (the acoustic weights are ignored). |
|
That's interesting, but unexpected- I would have expected that a scale of
1.0 would be optimal.
…On Wed, May 31, 2017 at 12:55 PM, Hossein Hadian ***@***.***> wrote:
Yes, the scale is applied to the graph/lm weight of the unsupervised
decoded lattices while generating chain supervision from them (the acoustic
weights are ignored).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu6AQoCI8bOdr7lKk7wPjlzvGz4LCks5r_ZuKgaJpZM4NprIG>
.
|
| fst::StdArc(phone, phone, | ||
| fst::TropicalWeight(lat_arc.weight.Weight().Value1() | ||
| * opts.lm_scale), | ||
| lat_arc.nextstate)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to include the weights when you process the final-probs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do it
src/chain/chain-supervision.cc
Outdated
| } | ||
|
|
||
| CompactLattice::Arc::StateId start = clat->Start(); // Should be 0 | ||
| BaseFloat total_backward_cost = beta[start]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this variable is not a cost, it is a negated cost (or a loglike), so the name isn't suitable. Fix it where you got the code from, as well.
src/chain/chain-supervision.cc
Outdated
| for (fst::StateIterator<CompactLattice> sit(*clat); !sit.Done(); sit.Next()) { | ||
| CompactLatticeWeight f = clat->Final(sit.Value()); | ||
| LatticeWeight w = f.Weight(); | ||
| w.SetValue1(w.Value1() + total_backward_cost); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't believe this code is quite right for this purpose.. The total_backward_cost can be interpreted as the negative of value1() + value2() of the total weight, and you are just adding it to the value1(). It's the value1() which we care about being normalized. But it won't end up normalized if you process it like this, it will have something with the magnitude of the value2() added to it. It would be OK if you were to zero out all the value2()'s first. But I think it would be easier if you were to do the normalization in a different way-- after converting it to an FST. You could compute the best-path cost using ShortestDistance, and then subtract that from all the final-probs.
This may require a slight refactoring of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do you mean I should convert it to FST, do the normalization, and the convert it back to lattice (all before calling PhoneLatticeToProtoSupervisionInternal) or alternatively do it on the fst of the resulting ProtoSupervision (i.e. after calling PhoneLatticeToProtoSupervisionInternal)?
|
Do it on the fst of the resulting ProtoSupervision. Of course, no need to
do it if the lm_scale is zero.
…On Wed, May 31, 2017 at 9:23 PM, Hossein Hadian ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/chain/chain-supervision.cc
<#1657 (comment)>:
> + }
+ }
+
+ std::vector<double> beta;
+ if (!ComputeCompactLatticeBetas(*clat, &beta)) {
+ KALDI_WARN << "Failed to compute backward probabilities on lattice.";
+ return false;
+ }
+
+ CompactLattice::Arc::StateId start = clat->Start(); // Should be 0
+ BaseFloat total_backward_cost = beta[start];
+
+ for (fst::StateIterator<CompactLattice> sit(*clat); !sit.Done(); sit.Next()) {
+ CompactLatticeWeight f = clat->Final(sit.Value());
+ LatticeWeight w = f.Weight();
+ w.SetValue1(w.Value1() + total_backward_cost);
So do you mean I should convert it to FST, do the normalization, and the
convert it back to lattice (all before calling
PhoneLatticeToProtoSupervisionInternal) or alternatively do it on the fst
of the resulting ProtoSupervision (i.e. after calling
PhoneLatticeToProtoSupervisionInternal)?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuwyGHL5x1OvewQ7up1gw5V3DKP1bks5r_hKYgaJpZM4NprIG>
.
|
|
Results of trying different lattice beams (still on tedlium): These are all with lm-scale = 0.0 (so they are not affected with how we normalize the lattices) I am currently running fisher_english, and I'll send the results as soon as they're ready. |
|
Interesting. Try lattice-beam of 1.0 and 0.1 as well.
Of course it's a little disappointing if the best results are with the
smallest lattice-beam, as it means the technique is no different from
regular unsupervised training.
Incidentally, I am reconsidering whether to remove the deriv-weight stuff,
as if we want to do frame-dependent confidence weighting we might want to
use that mechanism.
…On Fri, Jun 2, 2017 at 10:16 PM, Hossein Hadian ***@***.***> wrote:
Results of trying different lattice beams (still on tedlium):
# System baseline(200h) sup(50hr) beam8.0 beam4.0 beam2.0
# WER on dev(orig) 9.3 10.6 10.2 9.8 9.7
# WER on dev(rescored) 8.7 9.8 9.4 9.2 9.0
# WER on test(orig) 9.4 11.0 10.5 10.4 9.9
# WER on test(rescored) 9.0 10.3 9.9 9.6 9.2
# Final train prob -0.0893 -0.0861 -0.0840 -0.0836 -0.0848
# Final valid prob -0.1068 -0.1216 -0.1128 -0.1131 -0.1118
# Final train prob (xent) -1.4083 -1.5612 -1.4019 -1.4064 -1.4200
# Final valid prob (xent) -1.4601 -1.6863 -1.4901 -1.5103 -1.5116
These are all with lm-scale = 0.0 (so they are not affected with how we
normalize the lattices)
Also the WERs of decoding the unsupervised set is as follows:
*beam8.0*: %WER 7.50 [ 123226 / 1642860, 16571 ins, 41186 del, 65469 sub
] exp/chain_cleaned_semi/tdnn_sup1d_sp_bi/decode_train_
cleaned_unsup_rescore/wer_11_0.0
*beam2.0*: %WER 7.89 [ 129656 / 1642860, 17612 ins, 41667 del, 70377 sub
] exp/chain_cleaned_semi/tdnn_sup1d_sp_bi/decode_train_
cleaned_unsup_latbeam2.0_rescore/wer_11_0.0
I am currently running fisher_english, and I'll send the results as soon
as they're ready.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu9vz7RZsr732ctl3_sOyO64sbVdiks5sAMIHgaJpZM4NprIG>
.
|
|
.. something to be careful with is the language model used for decoding.
You should probably be excluding the unsupervised portion of the training
data from that LM. If the training data is *all* the LM is trained on, the
bias from this could be substantial; if it's mixed with other data, the
bias might be less.
This might affect which techniques work well.
…On Fri, Jun 2, 2017 at 11:00 PM, Daniel Povey ***@***.***> wrote:
Interesting. Try lattice-beam of 1.0 and 0.1 as well.
Of course it's a little disappointing if the best results are with the
smallest lattice-beam, as it means the technique is no different from
regular unsupervised training.
Incidentally, I am reconsidering whether to remove the deriv-weight stuff,
as if we want to do frame-dependent confidence weighting we might want to
use that mechanism.
On Fri, Jun 2, 2017 at 10:16 PM, Hossein Hadian ***@***.***>
wrote:
> Results of trying different lattice beams (still on tedlium):
>
> # System baseline(200h) sup(50hr) beam8.0 beam4.0 beam2.0
> # WER on dev(orig) 9.3 10.6 10.2 9.8 9.7
> # WER on dev(rescored) 8.7 9.8 9.4 9.2 9.0
> # WER on test(orig) 9.4 11.0 10.5 10.4 9.9
> # WER on test(rescored) 9.0 10.3 9.9 9.6 9.2
> # Final train prob -0.0893 -0.0861 -0.0840 -0.0836 -0.0848
> # Final valid prob -0.1068 -0.1216 -0.1128 -0.1131 -0.1118
> # Final train prob (xent) -1.4083 -1.5612 -1.4019 -1.4064 -1.4200
> # Final valid prob (xent) -1.4601 -1.6863 -1.4901 -1.5103 -1.5116
>
> These are all with lm-scale = 0.0 (so they are not affected with how we
> normalize the lattices)
> Also the WERs of decoding the unsupervised set is as follows:
> *beam8.0*: %WER 7.50 [ 123226 / 1642860, 16571 ins, 41186 del, 65469 sub
> ] exp/chain_cleaned_semi/tdnn_sup1d_sp_bi/decode_train_cleaned
> _unsup_rescore/wer_11_0.0
> *beam2.0*: %WER 7.89 [ 129656 / 1642860, 17612 ins, 41667 del, 70377 sub
> ] exp/chain_cleaned_semi/tdnn_sup1d_sp_bi/decode_train_cleaned
> _unsup_latbeam2.0_rescore/wer_11_0.0
>
> I am currently running fisher_english, and I'll send the results as soon
> as they're ready.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1657 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ADJVu9vz7RZsr732ctl3_sOyO64sbVdiks5sAMIHgaJpZM4NprIG>
> .
>
|
|
I will look into language model. Another effective option is left/right-tolerance I guess; which I have not tuned and is set to 2 in all these experiments. Also the fact that there is speaker overlap might be affecting. |
|
Results of trying different lattice beams (cont'd): Beam=2.0 has outperformed other beams. |
|
Interesting. It would be nice if you could try beams 2 and 4 in the setup
where the chunks are larger for the unsupervised portion of the data. This
might let us know whether the reason wider lattices aren't working well, is
that we are not handling the edge effects in chunk-splitting very carefully.
…On Sun, Jun 4, 2017 at 8:11 PM, Hossein Hadian ***@***.***> wrote:
Results of trying different lattice beams (cont'd):
# System beam0.1 beam1.0 beam2.0 beam4.0 beam8.0
# WER on dev(orig) 9.9 9.8 9.7 9.8 10.2
# WER on dev(rescored) 9.0 9.1 9.0 9.2 9.4
# WER on test(orig) 10.1 10.1 9.9 10.4 10.5
# WER on test(rescored) 9.6 9.5 9.2 9.6 9.9
# Final train prob -0.0839 -0.0853 -0.0848 -0.0836 -0.0840
# Final valid prob -0.1122 -0.1111 -0.1118 -0.1131 -0.1128
# Final train prob (xent) -1.4249 -1.4217 -1.4200 -1.4064 -1.4019
# Final valid prob (xent) -1.5211 -1.5153 -1.5116 -1.5103 -1.4901
Beam=2.0 has outperformed other beams.
Also regarding LM sources, I checked and it seems to me that only a very
small part of the source is the training data text and the rest is from
other sources.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuwEmipKvw3Wpgcmsw1WMI2MwR-BBks5sA0fDgaJpZM4NprIG>
.
|
|
Will do. There was some improvement but still lm-scale=0.0 is better. Here is the results of trying tolerance = 1 and 2. (Tolerance 0 leads to an assertion failure: left_tol+right_tol >= frame_subsampling_factor) I might need to try it with other beams to get conclusive results. I guess it's better to try them on fisher_english as the setup is almost ready now. |
…vised Travis was failing to compile(not sure why)-- I used the "Update Branch" button
|
The results on fisher_english (supervised: 100hr, unsupervised: 250hr): Effect of pruning when we exclude the unsupervised data text from LM: Comparison of decoding WER of the unsupervised data with the full LM vs the LM excluding unsuervised text: |
|
Results of trying smaller egs weight for the unsupervised data: Briefly put, no gain from changing unsupervised egs weight. Results of testing frames-per-eg=300 (i.e. 2x longer sequences) for unsupervised egs. Briefly put, frames_per_eg=300 has obviously improved the results (by 0.4% absolute) and still prune-beam 2 is better. Currently, best recovery rate is: |
|
You can use different minibatch-size for different lengths of egs for the
experiment with long egs, e.g.:
…--minibatch-size='150=128,64:300=64,32'
On Fri, Jun 23, 2017 at 5:29 PM, Hossein Hadian ***@***.***> wrote:
Results of trying smaller egs weight for the unsupervised data:
# System base350hr sup100hr egs_wt1.0 egs_wt0.75 egs_wt0.5
# WER on dev 17.74 20.03 19.55 19.70 19.57
# WER on test 17.57 20.20 19.43 19.55 19.57
# Final train prob -0.1128 -0.1064 -0.1112 -0.1105 -0.1104
# Final valid prob -0.1251 -0.1501 -0.1511 -0.1514 -0.1506
# Final train prob (xent) -1.7908 -1.7571 -1.5912 -1.5961 -1.5940
# Final valid prob (xent) -1.7712 -1.9253 -1.7379 -1.7428 -1.7475
Briefly put, no gain from changing unsupervised egs weight.
Results of testing frames-per-eg=300 (i.e. 2x longer sequences) for
unsupervised egs.
# System base350hr sup100hr sup100hr_mb64 prun1_fpe300 prun2_fpe300 prun4_fpe300
# WER on dev 17.74 20.03 20.23 19.25 19.12 19.34
# WER on test 17.57 20.20 19.90 19.32 19.13 19.26
# Final train prob -0.1128 -0.1064 -0.1017 -0.1028 -0.1015 -0.1019
# Final valid prob -0.1251 -0.1501 -0.1466 -0.1422 -0.1445 -0.1431
# Final train prob (xent) -1.7908 -1.7571 -1.7074 -1.5801 -1.5775 -1.5911
# Final valid prob (xent) -1.7712 -1.9253 -1.8770 -1.7479 -1.7592 -1.7388
Briefly put, frames_per_eg=300 has obviously improved the results and
still prune-beam 2 is better. Currently, best recovery rate is:
Test: 44%
Dev: 33%
[sup100hr_mb64 is the same as sup100hr but with minibatch-size=64 instead
of 128 to make it comparable to fpe300 experiments which use minibatch-size
64]
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu7Q8rcoxHQn1DJeGcaxsly9vM5aIks5sHC4ugaJpZM4NprIG>
.
|
|
I will use it. |
|
Interesting. Possibly tolerance=0 doesn't work well because, even though
the lattices were generated with the chain system and the alignments in the
lattices should be good, we shift the acoustics left and right by -1 and 1
frame (really -1/3 and +1/3 of a frame, at the output frame rate) during
training.
…On Sat, Jun 24, 2017 at 2:12 AM, Hossein Hadian ***@***.***> wrote:
I will use it.
Results for tolerances 0, 1, and 2 (lm-scale=0):
# System base350hr sup100hr prun2_tol0 prun2_tol1 prun2_tol2
# WER on dev 17.74 20.03 20.57 19.55 19.72
# WER on test 17.57 20.20 20.18 19.43 19.56
# Final train prob -0.1128 -0.1064 -0.1151 -0.1112 -0.1112
# Final valid prob -0.1251 -0.1501 -0.1558 -0.1511 -0.1507
# Final train prob (xent) -1.7908 -1.7571 -1.5276 -1.5912 -1.6506
# Final valid prob (xent) -1.7712 -1.9253 -1.6889 -1.7379 -1.8018
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu26AP-DMkcBKJTFdPuQnlebCRFKlks5sHKjSgaJpZM4NprIG>
.
|
|
Results of trying phone insertion penalties -0.5 and -1: There is a bit improvement (.1 on dev and .3 on test), so I combined it with frames-per-eg=300 which was helpful but no extra gain was achieved. So the avg recovery rate is still ~40%. |
|
Using confidences gives around 0.2-0.3% improvement, but its noisy and improvements are different on dev and test. |
|
Interesting. If we saw improvements on another dataset I might be inclined
to believe it.
Also I'd be interested to know whether using the raw, uncalibrated
confidences might be sufficient. I'd rather not have a dependency on the
confidence-calibration script. Look at Karel's publications on
unsupervised training and see what he did, I think he has worked with this
type of thing.
…On Wed, Jul 12, 2017 at 3:52 PM, Vimal Manohar ***@***.***> wrote:
Using confidences gives around 0.2-0.3% improvement, but its noisy and
improvements are different on dev and test.
System b conf_a_wgt0.5 conf_b_wgt0.3 WER on dev 19.24 19.09 19.03 WER on
test 18.91 18.67 18.86
$ Final train prob -0.1131 -0.1111 -0.1053
$ Final valid prob -0.1539 -0.1526 -0.1477
$ Final train prob (xent) -64.8031 -1.6046 -1.5923
$ Final valid prob (xent) -70.3492 -1.7545 -1.7555
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu0TdeqMU1sO4SGugMCbIQi1RiBfnks5sNSPlgaJpZM4NprIG>
.
|
|
I didn't find anything in Karel's publications. It's perhaps not published
yet.
On Wed, Jul 12, 2017 at 4:12 PM Daniel Povey <[email protected]>
wrote:
Interesting. If we saw improvements on another dataset I might be inclined
to believe it.
Also I'd be interested to know whether using the raw, uncalibrated
confidences might be sufficient. I'd rather not have a dependency on the
confidence-calibration script. Look at Karel's publications on
unsupervised training and see what he did, I think he has worked with this
type of thing.
On Wed, Jul 12, 2017 at 3:52 PM, Vimal Manohar ***@***.***>
wrote:
> Using confidences gives around 0.2-0.3% improvement, but its noisy and
> improvements are different on dev and test.
> System b conf_a_wgt0.5 conf_b_wgt0.3 WER on dev 19.24 19.09 19.03 WER on
> test 18.91 18.67 18.86
>
> $ Final train prob -0.1131 -0.1111 -0.1053
> $ Final valid prob -0.1539 -0.1526 -0.1477
> $ Final train prob (xent) -64.8031 -1.6046 -1.5923
> $ Final valid prob (xent) -70.3492 -1.7545 -1.7555
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1657 (comment)>,
or mute
> the thread
> <
https://github.com/notifications/unsubscribe-auth/ADJVu0TdeqMU1sO4SGugMCbIQi1RiBfnks5sNSPlgaJpZM4NprIG
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEATVwRksGFRNBNOYjdDMpqScqYUBQy7ks5sNSi6gaJpZM4NprIG>
.
--
Vimal Manohar
PhD Student
Electrical & Computer Engineering
Johns Hopkins University
|
|
Look here
http://www.fit.vutbr.cz/research/groups/speech/publi/2013/vesely_asru2013_0000267.pdf
As far as I can tell, he is using the posteriors from the MBR code,
without any calibration.
On Wed, Jul 12, 2017 at 4:29 PM, Vimal Manohar <[email protected]>
wrote:
… I didn't find anything in Karel's publications. It's perhaps not published
yet.
On Wed, Jul 12, 2017 at 4:12 PM Daniel Povey ***@***.***>
wrote:
> Interesting. If we saw improvements on another dataset I might be
inclined
> to believe it.
> Also I'd be interested to know whether using the raw, uncalibrated
> confidences might be sufficient. I'd rather not have a dependency on the
> confidence-calibration script. Look at Karel's publications on
> unsupervised training and see what he did, I think he has worked with
this
> type of thing.
>
>
> On Wed, Jul 12, 2017 at 3:52 PM, Vimal Manohar ***@***.***
>
> wrote:
>
> > Using confidences gives around 0.2-0.3% improvement, but its noisy and
> > improvements are different on dev and test.
> > System b conf_a_wgt0.5 conf_b_wgt0.3 WER on dev 19.24 19.09 19.03 WER
on
> > test 18.91 18.67 18.86
> >
> > $ Final train prob -0.1131 -0.1111 -0.1053
> > $ Final valid prob -0.1539 -0.1526 -0.1477
> > $ Final train prob (xent) -64.8031 -1.6046 -1.5923
> > $ Final valid prob (xent) -70.3492 -1.7545 -1.7555
> >
> > —
> > You are receiving this because you commented.
> > Reply to this email directly, view it on GitHub
> > <#1657 (comment)>,
> or mute
> > the thread
> > <
> https://github.com/notifications/unsubscribe-auth/
ADJVu0TdeqMU1sO4SGugMCbIQi1RiBfnks5sNSPlgaJpZM4NprIG
> >
> > .
> >
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1657 (comment)>,
or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/
AEATVwRksGFRNBNOYjdDMpqScqYUBQy7ks5sNSi6gaJpZM4NprIG>
> .
>
--
Vimal Manohar
PhD Student
Electrical & Computer Engineering
Johns Hopkins University
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu4IBCZKCwqleYrvFT1fRiENAyroYks5sNSypgaJpZM4NprIG>
.
|
|
Sure, but this is an old paper. It does not compare the MBR posteriors or lattice-posteriors vs the calibrated confidences. |
|
Sure. But anything that increases the complexity of the system, I need to
see evidence that the improvement it gives is in proportional with the
complexity.
…On Wed, Jul 12, 2017 at 4:39 PM, Vimal Manohar ***@***.***> wrote:
Sure, but this is an old paper. It does not compare the MBR posteriors or
lattice-posteriors vs the calibrated confidences.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuy3M-jKGgMkAp5_ALqLtsgIHQOQ6ks5sNS8WgaJpZM4NprIG>
.
|
|
Results for a small supervised set (20k utterances ~16hr) on fisher_english: Avg recovery rate is ~33%. One difference in this case is that lattice-beam 4 is better than lattice-beam 1 (both are still worse than lattice-beam 2). In the 100hr supervised set, lattice-beam 1 was similar or better than 4. |
|
Using uncalibrated confidences gives similar performance to calibrated confidence and both are better than not using deriv weights (which is the baseline). Also training phone LM using both unsupervised data best path and supervision alignments makes very little difference. |
|
OK, cool. So let's prefer the simpler recipe.
by "training LM..", do you mean the LM used for decoding, or the phone LM
used for chain training?
…On Sun, Jul 16, 2017 at 5:01 PM, Vimal Manohar ***@***.***> wrote:
Using uncalibrated confidences gives similar performance to calibrated
confidence and both are better than not using deriv weights. Also training
LM using both unsupervised data best path and supervision alignments makes
very little difference.
# System baseline calibrated_conf uncalibrated_conf phone_lm_weights
# WER on dev 19.24 19.09 18.94 18.67
# WER on test 18.91 18.67 18.72 18.91
# Final train prob -0.1131 -0.1111 -0.1105 -0.1134
# Final valid prob -0.1539 -0.1526 -0.1528 -0.1549
# Final train prob (xent) -64.8031 -1.6046 -1.5941 -1.5997
# Final valid prob (xent) -70.3492 -1.7545 -1.7445 -1.7569
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu-d4s-0OGVVOk1NtWYr-JJROMzfxks5sOnoPgaJpZM4NprIG>
.
|
Phone LM for chain training. |
|
I tried final model combination step with mixed supervised+unsupervised egs (previously it was using only supervised egs) and the results did not change (test got better by 0.1% and dev got worse by 0.1%). [I only tried this on the best case where |
|
If it turns out that it's not necessary to use the multi-task setup, I'd be
happier to check in something that doesn't use it. There are other ways to
weight the datasets, e.g. do it while generating egs. It isn't quite as
flexible but there are less ways for it to go wrong.
Dan
…On Sun, Jul 16, 2017 at 5:42 PM, Hossein Hadian ***@***.***> wrote:
I tried final model combination step with mixed supervised+unsupervised
egs (previously it was using only supervised egs) and the results did not
change (test got better by 0.1% and dev got worse by 0.1%). [I only tried
this on the best case where prune=2, lmwt=0, tol=1,
unsup_frames_per_eg=300]
So I guess the improvement in Vimal's results (0.2-0.4) is all due to
multi-task training. I think we should try it on the 16hr supervised set
too. I am not sure how multi-task training is doing better than training on
combined&shuffled supervised+unsuoervised egs.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1657 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu1X8WPYuEbncBN0Un9jglydbNUn2ks5sOoO9gaJpZM4NprIG>
.
|
|
Actually, I guess it is not about weighting since Vimal's best results (and my best results) are with weight 1.0 for both unsupervised and supervised egs. Maybe it's the order in which we see the egs while training. |
|
Hi, I'll have a new paper about semi-supervised training at INTERSPEECH'17. In my experiments I saw that that the confidence calibration usually did not lead to better SST results. The calibration is handy for presenting the recognition output. But, the 'ideally calibrated confidences', are usually not the same as the 'best confidences for SST'. |
|
Confidence for semi-supervised selection is overall not a good idea, there are publications like http://www.cs.cmu.edu/%7Erongz/icassp_2006.pdf |
|
Thanks for the paper. Well, i would say, it depends... :) It is true that the most confident data are not those which help the most in SST : https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/SemiSupervised-Interspeech2013_pub.pdf (Figure 3). While a 'careful' data selection can improve the results, compared to 'all-data-in' training... (my new paper https://drive.google.com/open?id=0B5FTXafjWqpIWlh6cUl2S1c0Ums). And yes, 'not-so-careful' data selection cause a WER degradation... |
|
Closing this PR as the other PR #2140 by Vimal has been merged. |
I have tried to write a rather generic script
local/chain/run_semisupervised.shwhich can be easily used for other setups and configs. Please let me know if anything needs to be fixed/improved. I'm working on it and will push more changes whenever ready.Current initial results with left/right-tolerance=2 and lattice-lm-scale=0.1 (nothing is tuned):
combined2 means the "decoding of unsupervised data and combining and retraining" has been performed for a second time (using combined1 model). Recovery of degaradation is from ~25% to ~38%.
Note: Supervised training is done almost only using the supervised data (i.e. 50h, even the ivector extractor and ivectors) except the initial fmllr alignment of the supervised data which I did using the tri3 gmm model trained on the whole dataset (I imagined it wouldn't make a difference).
Also, I did not use Pegah's egs-combination scripts because I realized a simple nnet3-chain-copy-egs command could do it.
@vimalmanohar @pegahgh