Skip to content

Conversation

@hhadian
Copy link
Contributor

@hhadian hhadian commented May 29, 2017

I have tried to write a rather generic script local/chain/run_semisupervised.sh which can be easily used for other setups and configs. Please let me know if anything needs to be fixed/improved. I'm working on it and will push more changes whenever ready.
Current initial results with left/right-tolerance=2 and lattice-lm-scale=0.1 (nothing is tuned):

# System                baseline(200h) sup(50h) combined1 combined2
# WER on dev(orig)            9.3      10.6      10.3      10.3
# WER on dev(rescored)        8.7       9.8       9.6       9.6
# WER on test(orig)           9.4      11.0      10.6      10.5
# WER on test(rescored)       9.0      10.3      10.0       9.8
# Final train prob        -0.0893   -0.0861   -0.0882   -0.0890
# Final valid prob        -0.1068   -0.1216   -0.1124   -0.1142
# Final train prob (xent)   -1.4083   -1.5612   -1.4318   -1.4369
# Final valid prob (xent)   -1.4601   -1.6863   -1.5244   -1.5223

combined2 means the "decoding of unsupervised data and combining and retraining" has been performed for a second time (using combined1 model). Recovery of degaradation is from ~25% to ~38%.
Note: Supervised training is done almost only using the supervised data (i.e. 50h, even the ivector extractor and ivectors) except the initial fmllr alignment of the supervised data which I did using the tri3 gmm model trained on the whole dataset (I imagined it wouldn't make a difference).
Also, I did not use Pegah's egs-combination scripts because I realized a simple nnet3-chain-copy-egs command could do it.

@vimalmanohar @pegahgh

@danpovey
Copy link
Contributor

danpovey commented May 29, 2017 via email

@hhadian
Copy link
Contributor Author

hhadian commented May 29, 2017

A potential problem with increasing the length of egs for unsupervised data is that they will be placed in different minibatches so that we won't have minibatches with mixed supervised/unsupervised egs. This might affect randomization in SGD in a bad way?

@danpovey
Copy link
Contributor

danpovey commented May 29, 2017 via email

@hhadian
Copy link
Contributor Author

hhadian commented May 30, 2017

New results (testing frames_per_eg=300 and more epochs):

# System                baseline   supervised   comb1a   comb1a_fpe300
# WER on dev(orig)            9.3      10.6      10.2      10.1
# WER on dev(rescored)        8.7       9.8       9.5       9.2
# WER on test(orig)           9.4      11.0      10.4      10.4
# WER on test(rescored)       9.0      10.3       9.8       9.9
# Final train prob        -0.0893   -0.0861   -0.0832   -0.0841
# Final valid prob        -0.1068   -0.1216   -0.1123   -0.1137
# Final train prob (xent)   -1.4083   -1.5612   -1.3885   -1.4014
# Final valid prob (xent)   -1.4601   -1.6863   -1.4917   -1.5180

comb1a is the same as combined1 in my previous comment except with 5 epochs instead if 4.
In these results recovery ranges from 35% to 54%.

@hhadian
Copy link
Contributor Author

hhadian commented May 30, 2017

One important thing that I forgot to mention yesterday: Currently the way I split to supervised/unsupervised is based on utterances so there is speaker overlap. If there shouldn't be any speaker overlap, I'll need to change it.

@danpovey
Copy link
Contributor

danpovey commented May 30, 2017 via email

@vimalmanohar
Copy link
Contributor

vimalmanohar commented May 30, 2017 via email

@danpovey
Copy link
Contributor

danpovey commented May 30, 2017 via email

@hhadian
Copy link
Contributor Author

hhadian commented May 31, 2017

OK will do. Results of testing different lm-scales:

# System                    lmsc0    lmsc0.1    lmsc0.3    lmsc1.0
# WER on dev(orig)           10.2      10.2      10.2      11.0
# WER on dev(rescored)        9.4       9.5       9.6      10.3
# WER on test(orig)          10.5      10.4      10.6      11.4
# WER on test(rescored)       9.9       9.8      10.0      10.5
# Final train prob        -0.0840   -0.0832   -0.0843   -0.0857
# Final valid prob        -0.1128   -0.1123   -0.1134   -0.1111
# Final train prob (xent)   -1.4019   -1.3885   -1.3976   -1.4325
# Final valid prob (xent)   -1.4901   -1.4917   -1.5097   -1.5283

@jtrmal
Copy link
Contributor

jtrmal commented May 31, 2017 via email

@hhadian
Copy link
Contributor Author

hhadian commented May 31, 2017

Yes, the scale is applied to the graph/lm weight of the unsupervised decoded lattices while generating chain supervision from them (the acoustic weights are ignored).

@danpovey
Copy link
Contributor

danpovey commented May 31, 2017 via email

fst::StdArc(phone, phone,
fst::TropicalWeight(lat_arc.weight.Weight().Value1()
* opts.lm_scale),
lat_arc.nextstate));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to include the weights when you process the final-probs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do it

}

CompactLattice::Arc::StateId start = clat->Start(); // Should be 0
BaseFloat total_backward_cost = beta[start];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this variable is not a cost, it is a negated cost (or a loglike), so the name isn't suitable. Fix it where you got the code from, as well.

for (fst::StateIterator<CompactLattice> sit(*clat); !sit.Done(); sit.Next()) {
CompactLatticeWeight f = clat->Final(sit.Value());
LatticeWeight w = f.Weight();
w.SetValue1(w.Value1() + total_backward_cost);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't believe this code is quite right for this purpose.. The total_backward_cost can be interpreted as the negative of value1() + value2() of the total weight, and you are just adding it to the value1(). It's the value1() which we care about being normalized. But it won't end up normalized if you process it like this, it will have something with the magnitude of the value2() added to it. It would be OK if you were to zero out all the value2()'s first. But I think it would be easier if you were to do the normalization in a different way-- after converting it to an FST. You could compute the best-path cost using ShortestDistance, and then subtract that from all the final-probs.
This may require a slight refactoring of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you mean I should convert it to FST, do the normalization, and the convert it back to lattice (all before calling PhoneLatticeToProtoSupervisionInternal) or alternatively do it on the fst of the resulting ProtoSupervision (i.e. after calling PhoneLatticeToProtoSupervisionInternal)?

@danpovey
Copy link
Contributor

danpovey commented Jun 1, 2017 via email

@hhadian
Copy link
Contributor Author

hhadian commented Jun 3, 2017

Results of trying different lattice beams (still on tedlium):

# System              baseline(200h) sup(50hr)  beam8.0    beam4.0  beam2.0
# WER on dev(orig)            9.3      10.6      10.2      9.8      9.7
# WER on dev(rescored)        8.7       9.8       9.4       9.2       9.0
# WER on test(orig)           9.4      11.0      10.5      10.4      9.9
# WER on test(rescored)       9.0      10.3       9.9       9.6       9.2
# Final train prob        -0.0893   -0.0861   -0.0840   -0.0836   -0.0848
# Final valid prob        -0.1068   -0.1216   -0.1128   -0.1131   -0.1118
# Final train prob (xent)   -1.4083   -1.5612   -1.4019   -1.4064   -1.4200
# Final valid prob (xent)   -1.4601   -1.6863   -1.4901   -1.5103   -1.5116

These are all with lm-scale = 0.0 (so they are not affected with how we normalize the lattices)
Also the WERs of decoding the unsupervised set is as follows:
beam8.0: %WER 7.50 [ 123226 / 1642860, 16571 ins, 41186 del, 65469 sub ] exp/chain_cleaned_semi/tdnn_sup1d_sp_bi/decode_train_cleaned_unsup_rescore/wer_11_0.0
beam2.0: %WER 7.89 [ 129656 / 1642860, 17612 ins, 41667 del, 70377 sub ] exp/chain_cleaned_semi/tdnn_sup1d_sp_bi/decode_train_cleaned_unsup_latbeam2.0_rescore/wer_11_0.0

I am currently running fisher_english, and I'll send the results as soon as they're ready.

@danpovey
Copy link
Contributor

danpovey commented Jun 3, 2017 via email

@danpovey
Copy link
Contributor

danpovey commented Jun 3, 2017 via email

@hhadian
Copy link
Contributor Author

hhadian commented Jun 3, 2017

I will look into language model. Another effective option is left/right-tolerance I guess; which I have not tuned and is set to 2 in all these experiments. Also the fact that there is speaker overlap might be affecting.

@hhadian
Copy link
Contributor Author

hhadian commented Jun 5, 2017

Results of trying different lattice beams (cont'd):

# System                    beam0.1   beam1.0   beam2.0   beam4.0   beam8.0
# WER on dev(orig)            9.9       9.8       9.7       9.8      10.2
# WER on dev(rescored)        9.0       9.1       9.0       9.2       9.4
# WER on test(orig)          10.1      10.1       9.9      10.4      10.5
# WER on test(rescored)       9.6       9.5       9.2       9.6       9.9
# Final train prob        -0.0839   -0.0853   -0.0848   -0.0836   -0.0840
# Final valid prob        -0.1122   -0.1111   -0.1118   -0.1131   -0.1128
# Final train prob (xent)   -1.4249   -1.4217   -1.4200   -1.4064   -1.4019
# Final valid prob (xent)   -1.5211   -1.5153   -1.5116   -1.5103   -1.4901

Beam=2.0 has outperformed other beams.
Also regarding LM sources, I checked and it seems to me that only a very small part of the source is the training data text and the rest is from other sources.

@danpovey
Copy link
Contributor

danpovey commented Jun 5, 2017 via email

@hhadian
Copy link
Contributor Author

hhadian commented Jun 5, 2017

Will do.
Also I tried lm-scale=1.0 once again after the normalization fix:

# System                lmsc1.0_fix1 lmsc1.0
# WER on dev(orig)           10.9      11.0
# WER on dev(rescored)       10.2      10.3
# WER on test(orig)          11.1      11.4
# WER on test(rescored)      10.5      10.5
# Final train prob        -0.0860   -0.0857
# Final valid prob        -0.1120   -0.1111
# Final train prob (xent)   -1.4273   -1.4325
# Final valid prob (xent)   -1.5188   -1.5283

There was some improvement but still lm-scale=0.0 is better. Here is the results of trying tolerance = 1 and 2. (Tolerance 0 leads to an assertion failure: left_tol+right_tol >= frame_subsampling_factor)

# System                tol1(beam4.0) tol2(beam4.0)
# WER on dev(orig)           10.0       9.8
# WER on dev(rescored)        9.4       9.2
# WER on test(orig)          10.1      10.4
# WER on test(rescored)       9.5       9.6
# Final train prob        -0.0830   -0.0836
# Final valid prob        -0.1105   -0.1131
# Final train prob (xent)   -1.3549   -1.4064
# Final valid prob (xent)   -1.4419   -1.5103

I might need to try it with other beams to get conclusive results. I guess it's better to try them on fisher_english as the setup is almost ready now.

@hhadian
Copy link
Contributor Author

hhadian commented Jun 20, 2017

The results on fisher_english (supervised: 100hr, unsupervised: 250hr):
Effect of pruning (lm-scale=0.0, tolerance=1):

# System              baseline350h    sup100hr    prun1    prun2   prun4
# WER on dev                 17.74     20.03     19.61     19.45     19.84
# WER on test                17.57     20.20     19.38     19.25     19.53
# Final train prob          -0.1128   -0.1064   -0.1119   -0.1096   -0.1094
# Final valid prob          -0.1251   -0.1501   -0.1505   -0.1500   -0.1527
# Final train prob (xent)   -1.7908   -1.7571   -1.5977   -1.5903   -1.5843
# Final valid prob (xent)   -1.7712   -1.9253   -1.7578   -1.7430   -1.7435

Effect of pruning when we exclude the unsupervised data text from LM:

# System                  base350hr sup100hr exLM_prun0.1 exLM_prun1 exLM_prun2 exLM_prun4
# WER on dev                 17.74     20.03     19.93     19.71     19.55     19.96
# WER on test                17.57     20.20     19.55     19.62     19.43     19.69
# Final train prob          -0.1128   -0.1064   -0.1117   -0.1105   -0.1112   -0.1088
# Final valid prob          -0.1251   -0.1501   -0.1512   -0.1519   -0.1511   -0.1529
# Final train prob (xent)   -1.7908   -1.7571   -1.6015   -1.6052   -1.5912   -1.5785
# Final valid prob (xent)   -1.7712   -1.9253   -1.7635   -1.7613   -1.7379   -1.7350

Comparison of decoding WER of the unsupervised data with the full LM vs the LM excluding unsuervised text:

full-LM: %WER 17.94 [ 489687 / 2730165, 50624 ins, 164552 del, 274511 sub ] exp/chain_semi350k/tdnn_xxsup1a_sp/decode_train_unsup250k/wer_10
LM excluding unsup text: %WER 20.58 [ 561823 / 2730165, 58475 ins, 175688 del, 327660 sub ] exp/chain_semi350k/tdnn_xxsup1a_sp/decode_train_unsup250k_exLM/wer_10

@hhadian
Copy link
Contributor Author

hhadian commented Jun 23, 2017

Results of trying smaller egs weight for the unsupervised data:

# System                base350hr sup100hr egs_wt1.0 egs_wt0.75 egs_wt0.5
# WER on dev               17.74     20.03     19.55     19.70     19.57
# WER on test               17.57     20.20     19.43     19.55     19.57
# Final train prob          -0.1128   -0.1064   -0.1112   -0.1105   -0.1104
# Final valid prob          -0.1251   -0.1501   -0.1511   -0.1514   -0.1506
# Final train prob (xent)   -1.7908   -1.7571   -1.5912   -1.5961   -1.5940
# Final valid prob (xent)   -1.7712   -1.9253   -1.7379   -1.7428   -1.7475

Briefly put, no gain from changing unsupervised egs weight.

Results of testing frames-per-eg=300 (i.e. 2x longer sequences) for unsupervised egs.

# System                base350hr sup100hr sup100hr_mb64 prun1_fpe300 prun2_fpe300 prun4_fpe300
# WER on dev               17.74     20.03     20.23     19.25     19.12     19.34
# WER on test               17.57     20.20     19.90     19.32     19.13     19.26
# Final train prob          -0.1128   -0.1064   -0.1017   -0.1028   -0.1015   -0.1019
# Final valid prob          -0.1251   -0.1501   -0.1466   -0.1422   -0.1445   -0.1431
# Final train prob (xent)   -1.7908   -1.7571   -1.7074   -1.5801   -1.5775   -1.5911
# Final valid prob (xent)   -1.7712   -1.9253   -1.8770   -1.7479   -1.7592   -1.7388

Briefly put, frames_per_eg=300 has obviously improved the results (by 0.4% absolute) and still prune-beam 2 is better. Currently, best recovery rate is:
Test: 44%
Dev: 33%
[sup100hr_mb64 is the same as sup100hr but with minibatch-size=64 instead of 128 to make it comparable to fpe300 experiments which use minibatch-size 64]

@danpovey
Copy link
Contributor

danpovey commented Jun 23, 2017 via email

@hhadian
Copy link
Contributor Author

hhadian commented Jun 24, 2017

I will use it.
Results for tolerances 0, 1, and 2 (lm-scale=0):

# System                base350hr sup100hr prun2_tol0 prun2_tol1 prun2_tol2
# WER on dev               17.74     20.03     20.57     19.55     19.72
# WER on test               17.57     20.20     20.18     19.43     19.56
# Final train prob          -0.1128   -0.1064   -0.1151   -0.1112   -0.1112
# Final valid prob          -0.1251   -0.1501   -0.1558   -0.1511   -0.1507
# Final train prob (xent)   -1.7908   -1.7571   -1.5276   -1.5912   -1.6506
# Final valid prob (xent)   -1.7712   -1.9253   -1.6889   -1.7379   -1.8018

@danpovey
Copy link
Contributor

danpovey commented Jun 24, 2017 via email

@hhadian
Copy link
Contributor Author

hhadian commented Jun 27, 2017

Results of trying phone insertion penalties -0.5 and -1:

# System                  base350hr  sup100hr   pip0   pip-0.5   pip-1
# WER on dev                17.74     20.03     19.55     19.50     19.40
# WER on test               17.57     20.20     19.43     19.19     19.23
# Final train prob          -0.1128   -0.1064   -0.1112   -0.1121   -0.1116
# Final valid prob          -0.1251   -0.1501   -0.1511   -0.1504   -0.1521
# Final train prob (xent)   -1.7908   -1.7571   -1.5912   -1.5899   -1.5976
# Final valid prob (xent)   -1.7712   -1.9253   -1.7379   -1.7424   -1.7533

There is a bit improvement (.1 on dev and .3 on test), so I combined it with frames-per-eg=300 which was helpful but no extra gain was achieved. So the avg recovery rate is still ~40%.
I guess we should now try these with a smaller supervised set like 30hr or even less.

@vimalmanohar
Copy link
Contributor

vimalmanohar commented Jul 12, 2017

Using confidences gives around 0.2-0.3% improvement, but its noisy and improvements are different on dev and test.

# System                         b               conf_a_wgt0.5   conf_b_wgt0.3  
# WER on dev                     19.24           19.09           19.03          
# WER on test                    18.91           18.67           18.86          
# Final train prob               -0.1131         -0.1111         -0.1053        
# Final valid prob               -0.1539         -0.1526         -0.1477        
# Final train prob (xent)        -64.8031        -1.6046         -1.5923        
# Final valid prob (xent)        -70.3492        -1.7545         -1.7555

@danpovey
Copy link
Contributor

danpovey commented Jul 12, 2017 via email

@vimalmanohar
Copy link
Contributor

vimalmanohar commented Jul 12, 2017 via email

@danpovey
Copy link
Contributor

danpovey commented Jul 12, 2017 via email

@vimalmanohar
Copy link
Contributor

Sure, but this is an old paper. It does not compare the MBR posteriors or lattice-posteriors vs the calibrated confidences.

@danpovey
Copy link
Contributor

danpovey commented Jul 12, 2017 via email

@hhadian
Copy link
Contributor Author

hhadian commented Jul 12, 2017

Results for a small supervised set (20k utterances ~16hr) on fisher_english:
Tolerance is 1, lm-weight is 0 and frames-per-eg is 300.

# System                fullset270k   sup20k   semi_prune1  semi_prune2 semi_prune4
# WER on dev                18.07     27.44     24.42     23.82     24.29
# WER on test               18.35     26.54     23.83     23.65     23.65
# Final train prob          -0.1129   -0.1233   -0.0998   -0.0864   -0.1047
# Final valid prob          -0.1515   -0.1606   -0.1443   -0.1326   -0.1437
# Final train prob (xent)   -1.7665   -1.9930   -1.5169   -1.4797   -1.4869
# Final valid prob (xent)   -1.9411   -2.0966   -1.6458   -1.6352   -1.6162

Avg recovery rate is ~33%. One difference in this case is that lattice-beam 4 is better than lattice-beam 1 (both are still worse than lattice-beam 2). In the 100hr supervised set, lattice-beam 1 was similar or better than 4.

@vimalmanohar
Copy link
Contributor

vimalmanohar commented Jul 16, 2017

Using uncalibrated confidences gives similar performance to calibrated confidence and both are better than not using deriv weights (which is the baseline). Also training phone LM using both unsupervised data best path and supervision alignments makes very little difference.

# System                         baseline             calibrated_conf      uncalibrated_conf    phone_lm_weights
# WER on dev                     19.24                19.09                18.94                18.67
# WER on test                    18.91                18.67                18.72                18.91
# Final train prob               -0.1131              -0.1111              -0.1105              -0.1134
# Final valid prob               -0.1539              -0.1526              -0.1528              -0.1549
# Final train prob (xent)        -64.8031             -1.6046              -1.5941              -1.5997
# Final valid prob (xent)        -70.3492             -1.7545              -1.7445              -1.7569

@danpovey
Copy link
Contributor

danpovey commented Jul 16, 2017 via email

@vimalmanohar
Copy link
Contributor

OK, cool. So let's prefer the simpler recipe. by "training LM..", do you mean the LM used for decoding, or the phone LM used for chain training?

Phone LM for chain training.

@hhadian
Copy link
Contributor Author

hhadian commented Jul 16, 2017

I tried final model combination step with mixed supervised+unsupervised egs (previously it was using only supervised egs) and the results did not change (test got better by 0.1% and dev got worse by 0.1%). [I only tried this on the best case where prune=2, lmwt=0, tol=1, unsup_frames_per_eg=300]
So I guess the improvement in Vimal's results (0.2-0.4) is all due to multi-task training. I think we should try it on the 16hr supervised set too. I am not sure how multi-task training is doing better than training on combined&shuffled supervised+unsuoervised egs.

@danpovey
Copy link
Contributor

danpovey commented Jul 16, 2017 via email

@hhadian
Copy link
Contributor Author

hhadian commented Jul 16, 2017

Actually, I guess it is not about weighting since Vimal's best results (and my best results) are with weight 1.0 for both unsupervised and supervised egs. Maybe it's the order in which we see the egs while training.
Hossein

@KarelVesely84
Copy link
Contributor

KarelVesely84 commented Jul 24, 2017

Hi, I'll have a new paper about semi-supervised training at INTERSPEECH'17.
https://drive.google.com/open?id=0B5FTXafjWqpIWlh6cUl2S1c0Ums

In my experiments I saw that that the confidence calibration usually did not lead to better SST results. The calibration is handy for presenting the recognition output. But, the 'ideally calibrated confidences', are usually not the same as the 'best confidences for SST'.

@nshmyrev
Copy link
Contributor

Confidence for semi-supervised selection is overall not a good idea, there are publications like http://www.cs.cmu.edu/%7Erongz/icassp_2006.pdf

@KarelVesely84
Copy link
Contributor

Thanks for the paper. Well, i would say, it depends... :) It is true that the most confident data are not those which help the most in SST : https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/SemiSupervised-Interspeech2013_pub.pdf (Figure 3). While a 'careful' data selection can improve the results, compared to 'all-data-in' training... (my new paper https://drive.google.com/open?id=0B5FTXafjWqpIWlh6cUl2S1c0Ums). And yes, 'not-so-careful' data selection cause a WER degradation...

@hhadian
Copy link
Contributor Author

hhadian commented Apr 26, 2018

Closing this PR as the other PR #2140 by Vimal has been merged.

@hhadian hhadian closed this Apr 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants