Incorrect recognotion of specific words - additional letters inserted #1011

vidiecan · 2017-06-28T14:46:45Z

Tesseract (master branch, tested several versions on Windows/Ubuntu) using LSTM only inserts sometimes letters where they should not be. We have observed this happening randomly (but reproducible) also with eng, ces, nor (others not tested), also with different DPI.

Test case:

./tesseract ./to.recognise.1498480501215-2111908889.png stdout --oem 1 -l ces --psm 13

Expected: Veitvetveien
Got: Veitvetvelien

Letter bboxes visualised below

The text was updated successfully, but these errors were encountered:

Shreeshrii · 2017-06-28T16:19:28Z

Similar Issue - #884

vidiecan · 2017-08-28T07:26:26Z

Using:

latin.traineddata

From the tesseract code, it seems that there is no handling of the situation when LSTM produces different results for one letter in time e.g., there is a line with the letter S

...
128 null_char score=-11.4265, c=-0.0850153, perm=0, hash=fa6bdef8 prev:null_char score=-11.3415, c=-0.0935133, perm=0, hash=fa6bdef8
129 null_char score=-11.5115, c=-0.0850045, perm=0, hash=fa6bdef8 prev:null_char score=-11.4265, c=-0.0850153, perm=0, hash=fa6bdef8
130 null_char score=-11.5966, c=-0.0850804, perm=0, hash=fa6bdef8 prev:null_char score=-11.5115, c=-0.0850045, perm=0, hash=fa6bdef8
131 null_char score=-11.6853, c=-0.088736, perm=0, hash=fa6bdef8 prev:null_char score=-11.5966, c=-0.0850804, perm=0, hash=fa6bdef8
132 label=3, uid=5=S [53 ]A score=-12.0022, c=-0.316888, Start End perm=8, hash=6b41093b prev:null_char score=-11.6853, c=-0.088736, perm=0, hash=fa6bdef8
133 label=3, uid=5=S [53 ]A score=-13.1056, c=-1.10339, perm=8, hash=6b41093b prev:label=3, uid=5=S [53 ]A score=-12.0022, c=-0.316888, Start End perm=8, hash=6b41093b
134 label=64, uid=66=$ [24 ] score=-13.2539, c=-0.14831, End perm=8, hash=86b8e412 prev:label=3, uid=5=S [53 ]A score=-13.1056, c=-1.10339, perm=8, hash=6b41093b
135 label=64, uid=66=$ [24 ] score=-13.8318, c=-0.577888, perm=8, hash=86b8e412 prev:label=64, uid=66=$ [24 ] score=-13.2539, c=-0.14831, End perm=8, hash=86b8e412
136 null_char score=-13.9704, c=-0.138632, perm=0, hash=86b8e412 prev:label=64, uid=66=$ [24 ] score=-13.8318, c=-0.577888, perm=8, hash=86b8e412
137 null_char score=-14.0555, c=-0.0850982, perm=0, hash=86b8e412 prev:null_char score=-13.9704, c=-0.138632, perm=0, hash=86b8e412
138 null_char score=-14.1405, c=-0.0850008, perm=0, hash=86b8e412 prev:null_char score=-14.0555, c=-0.0850982, perm=0, hash=86b8e412
139 null_char score=-14.2255, c=-0.0850006, perm=0, hash=86b8e412 prev:null_char score=-14.1405, c=-0.0850008, perm=0, hash=86b8e412
140 null_char score=-14.3105, c=-0.085, perm=0, hash=86b8e412 prev:null_char score=-14.2255, c=-0.0850006, perm=0, hash=86b8e412

and

Best choice: accepted=0, adaptable=0, done=1 : Lang result : S$ : R=3.06256, C=-7.72375, F=1, Perm=8, xht=[0,3.40282e+38], ambig=0
pos     NORM    NORM
str     S       $
state:  1       1
C       -1.103  -0.578

@theraysmith
What is the expected behaviour in this case - should multiple choices (clearly) in one character be handled?
Thank you.

vidiecan · 2018-04-20T20:39:49Z

For the record, you can try to fine tune/fix clear problems for example around the line

tesseract/lstm/recodebeam.cpp

Line 266 in 8f7be2e

// Backtrack extracting only valid, non-duplicate unichar-ids.

by computing the ""count"" of times a character was recognised and disregard clear outliers e.g., if a letter was seen as M 8 times and as N only once (and because of specific reasons it would still get the lowest certainty) "transform" N to M.

However, better training would fix it too.

vidiecan · 2018-04-20T20:39:58Z

I think you can close this one

jghare · 2018-04-21T07:05:34Z

Thanks for taking this up as a serious issue. I being a pure user (not code contributor) just need a way to switch on and off extra characters.
So when switched off it only spits one character which it "feels" is most probable and trash the rest of the choices.
Other post processing is possible from the user side. For example often the user knows if a certain character is in the right position or not like 5 in place of S or vice versa and can implement those changes in sed, awk scripts etc to clean their data... At least thats what I am doing.. :)
But the extra character makes things very difficult. Post processing required will be cut down by at least 50% if no extra characters are added...
I am also sure that the extra character addition would be a sweet feature for someone else but not for me for my current application hence the request for an on/off switch
Thanks again!

Shreeshrii · 2018-04-21T11:54:38Z

@vidiecan Do you have a PR for your suggested changes?

@stweil Is this possible to be addressed for 4.0.0?

stweil · 2020-11-09T07:40:59Z

Latest Tesseract and also release 4.1.1 recognize "Veitvetveien", so this issue seems to be fixed.

amitdo mentioned this issue Aug 1, 2017

German - Characters added to result multiple times (aä / AÄ) #1060

Open

vidiecan mentioned this issue Apr 20, 2018

Tesseract inserting additional alternative characters #1465

Open

stweil added the accuracy label Nov 7, 2020

stweil mentioned this issue Nov 7, 2020

Character confusion fix suggestion #3144

Open

amitdo added the diplopia label Mar 17, 2021

amitdo closed this as completed Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect recognotion of specific words - additional letters inserted #1011

Incorrect recognotion of specific words - additional letters inserted #1011

vidiecan commented Jun 28, 2017

Shreeshrii commented Jun 28, 2017

vidiecan commented Aug 28, 2017

vidiecan commented Apr 20, 2018

vidiecan commented Apr 20, 2018

jghare commented Apr 21, 2018

Shreeshrii commented Apr 21, 2018

stweil commented Nov 9, 2020

Incorrect recognotion of specific words - additional letters inserted #1011

Incorrect recognotion of specific words - additional letters inserted #1011

Comments

vidiecan commented Jun 28, 2017

Shreeshrii commented Jun 28, 2017

vidiecan commented Aug 28, 2017

vidiecan commented Apr 20, 2018

vidiecan commented Apr 20, 2018

jghare commented Apr 21, 2018

Shreeshrii commented Apr 21, 2018

stweil commented Nov 9, 2020