Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect recognotion of specific words - additional letters inserted #1011

Closed
vidiecan opened this issue Jun 28, 2017 · 7 comments
Closed

Comments

@vidiecan
Copy link

Tesseract (master branch, tested several versions on Windows/Ubuntu) using LSTM only inserts sometimes letters where they should not be. We have observed this happening randomly (but reproducible) also with eng, ces, nor (others not tested), also with different DPI.

Test case:
to recognise 1498480501215-2111908889

./tesseract ./to.recognise.1498480501215-2111908889.png stdout --oem 1 -l ces --psm 13

Expected: Veitvetveien
Got: Veitvetvelien

Letter bboxes visualised below
tesseract bug report

@Shreeshrii
Copy link
Collaborator

Similar Issue - #884

@vidiecan
Copy link
Author

Using:

  • latin.traineddata

From the tesseract code, it seems that there is no handling of the situation when LSTM produces different results for one letter in time e.g., there is a line with the letter S

...
128 null_char score=-11.4265, c=-0.0850153, perm=0, hash=fa6bdef8 prev:null_char score=-11.3415, c=-0.0935133, perm=0, hash=fa6bdef8
129 null_char score=-11.5115, c=-0.0850045, perm=0, hash=fa6bdef8 prev:null_char score=-11.4265, c=-0.0850153, perm=0, hash=fa6bdef8
130 null_char score=-11.5966, c=-0.0850804, perm=0, hash=fa6bdef8 prev:null_char score=-11.5115, c=-0.0850045, perm=0, hash=fa6bdef8
131 null_char score=-11.6853, c=-0.088736, perm=0, hash=fa6bdef8 prev:null_char score=-11.5966, c=-0.0850804, perm=0, hash=fa6bdef8
132 label=3, uid=5=S [53 ]A score=-12.0022, c=-0.316888, Start End perm=8, hash=6b41093b prev:null_char score=-11.6853, c=-0.088736, perm=0, hash=fa6bdef8
133 label=3, uid=5=S [53 ]A score=-13.1056, c=-1.10339, perm=8, hash=6b41093b prev:label=3, uid=5=S [53 ]A score=-12.0022, c=-0.316888, Start End perm=8, hash=6b41093b
134 label=64, uid=66=$ [24 ] score=-13.2539, c=-0.14831, End perm=8, hash=86b8e412 prev:label=3, uid=5=S [53 ]A score=-13.1056, c=-1.10339, perm=8, hash=6b41093b
135 label=64, uid=66=$ [24 ] score=-13.8318, c=-0.577888, perm=8, hash=86b8e412 prev:label=64, uid=66=$ [24 ] score=-13.2539, c=-0.14831, End perm=8, hash=86b8e412
136 null_char score=-13.9704, c=-0.138632, perm=0, hash=86b8e412 prev:label=64, uid=66=$ [24 ] score=-13.8318, c=-0.577888, perm=8, hash=86b8e412
137 null_char score=-14.0555, c=-0.0850982, perm=0, hash=86b8e412 prev:null_char score=-13.9704, c=-0.138632, perm=0, hash=86b8e412
138 null_char score=-14.1405, c=-0.0850008, perm=0, hash=86b8e412 prev:null_char score=-14.0555, c=-0.0850982, perm=0, hash=86b8e412
139 null_char score=-14.2255, c=-0.0850006, perm=0, hash=86b8e412 prev:null_char score=-14.1405, c=-0.0850008, perm=0, hash=86b8e412
140 null_char score=-14.3105, c=-0.085, perm=0, hash=86b8e412 prev:null_char score=-14.2255, c=-0.0850006, perm=0, hash=86b8e412

and

Best choice: accepted=0, adaptable=0, done=1 : Lang result : S$ : R=3.06256, C=-7.72375, F=1, Perm=8, xht=[0,3.40282e+38], ambig=0
pos     NORM    NORM
str     S       $
state:  1       1
C       -1.103  -0.578

@theraysmith
What is the expected behaviour in this case - should multiple choices (clearly) in one character be handled?
Thank you.

@vidiecan
Copy link
Author

For the record, you can try to fine tune/fix clear problems for example around the line

// Backtrack extracting only valid, non-duplicate unichar-ids.

by computing the ""count"" of times a character was recognised and disregard clear outliers e.g., if a letter was seen as M 8 times and as N only once (and because of specific reasons it would still get the lowest certainty) "transform" N to M.

However, better training would fix it too.

@vidiecan
Copy link
Author

I think you can close this one

@jghare
Copy link

jghare commented Apr 21, 2018

Thanks for taking this up as a serious issue. I being a pure user (not code contributor) just need a way to switch on and off extra characters.
So when switched off it only spits one character which it "feels" is most probable and trash the rest of the choices.
Other post processing is possible from the user side. For example often the user knows if a certain character is in the right position or not like 5 in place of S or vice versa and can implement those changes in sed, awk scripts etc to clean their data... At least thats what I am doing.. :)
But the extra character makes things very difficult. Post processing required will be cut down by at least 50% if no extra characters are added...
I am also sure that the extra character addition would be a sweet feature for someone else but not for me for my current application hence the request for an on/off switch
Thanks again!

@Shreeshrii
Copy link
Collaborator

@vidiecan Do you have a PR for your suggested changes?

@stweil Is this possible to be addressed for 4.0.0?

@stweil
Copy link
Contributor

stweil commented Nov 9, 2020

Latest Tesseract and also release 4.1.1 recognize "Veitvetveien", so this issue seems to be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants