Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lstmeval on trained model appears to be making Unicode substitution #270

Closed
johnbeard opened this issue Jul 27, 2021 · 5 comments
Closed
Labels
question Further information is requested

Comments

@johnbeard
Copy link

I am trying to create a model for old-style English printing using the long-s (ſ) character. Generally, this occurs for any non-final lowercase 's' in a word.

I have a model successful trained which appears to have reasonable accuracy (at least better in some ways than eng which obviously mistakes them usually as 'f'). I am now trying to evaluate the accuracy of the model so I can make adjustments in the right directions.

I have generated a set of ground-truth images (which are from real scans, though the model was trained from generated text).
However, the result of the lstmeval shows the long-s substituted by s.

For example:

lstmeval --model data/eng_oldcaslon_longs.traineddata --eval_listfile data/eval_eng_old/all-lstmf --verbosity 2

....

Truth:incapable of diſcharging the ſocial duties of life, or enjoying the felicities of it.
OCR  :incapable of discharging the social duties of life, or enjoying the felicities of it.
Truth:I mean not to exhibit horror for the purpoſe of provoking revenge, but to
OCR  :I mean not to exhibit horror for the purpose of provoking revenge, but tol :,
Line Char error rate=0.068493, Word error rate=0.071429

This appears to actually be successfully recognising long-s, because 1) there's no error in the first line and 2) if it wasn't, a ground-truth longs would be seen as 'f' (or maybe 'l'), not 's'.

However, in the OCR: lines, it's being printed as an 's'. This is making it a little awkward for me to compare failure modes while tweaking the model.

@stweil
Copy link
Member

stweil commented Jul 27, 2021

I can confirm this behaviour.

Hint: there exist already models which might work for you, for example the standard model script/Fraktur or our frak2021 models.

@johnbeard
Copy link
Author

johnbeard commented Jul 27, 2021

@stweil thanks!

I'm surprised the Fraktur models worked so well (frak2021: CER=4.05, WER=11.4), since this isn't fraktur. Image from the evaluation corpus:

common-sense-p20-039

Image generated from the tessedit_write_images=1 output.

What is frak2021 trained on, out of interest? It's very impressive.

I can't use eng to compare without more work as it won't encode since ſ isn't in that model at all, but I get 9.5/25 with ita_old, 10/25 with frk, 6.2/18 with GT4HistOCR, 8.6/24.5 with script/Fraktur and my current best new model is 6.3/17.3.

@johnbeard johnbeard changed the title lstmeval on trained model appear to be making Unicode substitution lstmeval on trained model appears to be making Unicode substitution Jul 27, 2021
@stweil
Copy link
Member

stweil commented Jul 27, 2021

What is frak2021 trained on, out of interest?

See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR#frak2021.

@johnbeard
Copy link
Author

Also by the way, do you have a tool for the ground text lines? I find doing them with a text editor to be extremely annoying.

@wrznr wrznr added the question Further information is requested label Aug 27, 2021
@wrznr
Copy link
Collaborator

wrznr commented Aug 27, 2021

Have a look at https://github.com/OCR4all/LAREX. Also OCRpy has a simplistic browser-based transcription utility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants