Skip to content

Conversation

@hhadian
Copy link
Contributor

@hhadian hhadian commented Apr 7, 2018

My last PR (#2315) introduced some issues which I have fixed in this PR.
The main issue was that I created one dict only (which includes the 50k most frequent words). This dict is for decoding and if we use it for training we get some WER degradation because we'll have around 1% OOV in the training text. The results in that PR are correct though because I used trained models (from before the change in the dict) but if we run from scratch we can't replicate the results.
So I create separate langs for training and decoding in this PR. The training lang includes all the training words but the decoding lang includes only the 50k most frequent ones.

Another subtle issue is that the variable $lang_test in chain scripts was used in wrong places (for example for lattice generation). This leads to big lattices and WER degradation if lang_unk is used. I never noticed this because I always used the same lattices I generated a long time ago.

The rest of this PR:

  • Update the e2e results (after using the correct lang for training). The regular chain results were already correct.
  • Some cleaning and minor bug fixes.

@danpovey danpovey merged commit c643295 into kaldi-asr:master Apr 7, 2018
LvHang pushed a commit to LvHang/kaldi that referenced this pull request Apr 14, 2018
Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants