-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect recognotion of specific words - additional letters inserted #1011
Comments
Similar Issue - #884 |
Using:
From the tesseract code, it seems that there is no handling of the situation when LSTM produces different results for one letter in time e.g., there is a line with the letter S
and
@theraysmith |
For the record, you can try to fine tune/fix clear problems for example around the line Line 266 in 8f7be2e
by computing the ""count"" of times a character was recognised and disregard clear outliers e.g., if a letter was seen as M 8 times and as N only once (and because of specific reasons it would still get the lowest certainty) "transform" N to M. However, better training would fix it too. |
I think you can close this one |
Thanks for taking this up as a serious issue. I being a pure user (not code contributor) just need a way to switch on and off extra characters. |
Latest Tesseract and also release 4.1.1 recognize "Veitvetveien", so this issue seems to be fixed. |
Tesseract (master branch, tested several versions on Windows/Ubuntu) using LSTM only inserts sometimes letters where they should not be. We have observed this happening randomly (but reproducible) also with eng, ces, nor (others not tested), also with different DPI.
Test case:
Expected: Veitvetveien
Got: Veitvetvelien
Letter bboxes visualised below
The text was updated successfully, but these errors were encountered: