Some fixes in IAM scripts re using the right lang for training #2340
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
My last PR (#2315) introduced some issues which I have fixed in this PR.
The main issue was that I created one
dictonly (which includes the 50k most frequent words). This dict is for decoding and if we use it for training we get some WER degradation because we'll have around 1% OOV in the training text. The results in that PR are correct though because I used trained models (from before the change in the dict) but if we run from scratch we can't replicate the results.So I create separate langs for training and decoding in this PR. The training lang includes all the training words but the decoding lang includes only the 50k most frequent ones.
Another subtle issue is that the variable $lang_test in chain scripts was used in wrong places (for example for lattice generation). This leads to big lattices and WER degradation if
lang_unkis used. I never noticed this because I always used the same lattices I generated a long time ago.The rest of this PR: