Some fixes in IAM scripts re using the right lang for training #2340

hhadian · 2018-04-07T15:53:55Z

My last PR (#2315) introduced some issues which I have fixed in this PR.
The main issue was that I created one dict only (which includes the 50k most frequent words). This dict is for decoding and if we use it for training we get some WER degradation because we'll have around 1% OOV in the training text. The results in that PR are correct though because I used trained models (from before the change in the dict) but if we run from scratch we can't replicate the results.
So I create separate langs for training and decoding in this PR. The training lang includes all the training words but the decoding lang includes only the 50k most frequent ones.

Another subtle issue is that the variable $lang_test in chain scripts was used in wrong places (for example for lattice generation). This leads to big lattices and WER degradation if lang_unk is used. I never noticed this because I always used the same lattices I generated a long time ago.

The rest of this PR:

Update the e2e results (after using the correct lang for training). The regular chain results were already correct.
Some cleaning and minor bug fixes.

…e generation + cleaning

…ration + cleaning (kaldi-asr#2340)

Some fixes in IAM scripts re using the right lang for training/lattic…

1568301

…e generation + cleaning

danpovey merged commit c643295 into kaldi-asr:master Apr 7, 2018

LvHang pushed a commit to LvHang/kaldi that referenced this pull request Apr 14, 2018

[egs] IAM script fixes using the right lang for training/lattice gene…

0315fac

…ration + cleaning (kaldi-asr#2340)

hhadian mentioned this pull request Apr 14, 2018

A fix to second-stage chain recipes in IAM and a small fix in UW3 #2358

Merged

Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018

[egs] IAM script fixes using the right lang for training/lattice gene…

d0c7e25

…ration + cleaning (kaldi-asr#2340)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some fixes in IAM scripts re using the right lang for training #2340

Some fixes in IAM scripts re using the right lang for training #2340

Uh oh!

hhadian commented Apr 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Some fixes in IAM scripts re using the right lang for training #2340

Some fixes in IAM scripts re using the right lang for training #2340

Uh oh!

Conversation

hhadian commented Apr 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants