Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract LSTM 4.0: letters repeat in recognized text #884

Closed
blueshade7 opened this issue May 5, 2017 · 6 comments
Closed

Tesseract LSTM 4.0: letters repeat in recognized text #884

blueshade7 opened this issue May 5, 2017 · 6 comments

Comments

@blueshade7
Copy link

When I run tesseract command line program (Windows prebuilt binary, 4.0.0 alpha) on this image in LSTM mode, I get:
LoOrenm 1pPpSsSUlI

Why letters repeat? Stuttering?
In Tesseract mode (oem=0), I get correct text: Lorem ipsum

sample

@stweil
Copy link
Contributor

stweil commented May 5, 2017

This might be one more example where the old 3.x recognizer produces better results than 4.x with LSTM. See here for the related discussion.

@blueshade7
Copy link
Author

I dug into this and found that letter bboxes are narrower than they should be.
Debugged hand-built Tess4 on another platform so output is a bit different from above but PageIterator::BoundingBox returns bbox narrower than actual as shown below. Seems like the same glyph image is recognized a couple of times while horizontal scan striding shorter than it should:

letter L BoundingBox=(2, 48, 375, 230)
letter o BoundingBox=(484, 76, 521, 236)
letter O BoundingBox=(521, 76, 559, 236)
letter r BoundingBox=(559, 76, 1043, 236)
letter e BoundingBox=(1119, 76, 1438, 236)
letter I BoundingBox=(1527, 76, 1564, 230)
letter n BoundingBox=(1564, 76, 1611, 230)
letter m BoundingBox=(1611, 76, 1658, 230)
letter n BoundingBox=(1658, 76, 1890, 230)
letter 1 BoundingBox=(2182, 1, 2436, 230)
letter p BoundingBox=(2607, 76, 2645, 295)
letter P BoundingBox=(2645, 76, 2682, 295)
letter p BoundingBox=(2682, 76, 2784, 295)
letter S BoundingBox=(2784, 76, 2826, 295)
letter s BoundingBox=(2999, 76, 3036, 237)
letter S BoundingBox=(3036, 76, 3129, 237)
letter U BoundingBox=(3186, 74, 3390, 236)
letter l BoundingBox=(3390, 74, 3548, 236)
letter I BoundingBox=(3548, 74, 3582, 236)
letter M BoundingBox=(3790, 76, 4172, 230)

@kolomiyets
Copy link

I also noticed double characters in the output, but they disappear (although not completely) as soon as the model gets better (~ < 0.1%).

@blueshade7
Copy link
Author

I drew bounding box around each recognized letter in this image. While some are spot on but many off even though text is correctly recognized as "Simple Test". Note boxes are intentionally drew off at top and bottom to minimize a chance of box overlaps.
simpletest_bbox

@Shreeshrii
Copy link
Collaborator

This is probably because LSTM engine trains on text lines rather than separate letters.

@theraysmith can clarify.

@stweil
Copy link
Contributor

stweil commented Nov 9, 2020

I cannot reproduce the issue with latest Tesseract and with release 4.1.1. Both produce Lorem 17S tira which is not correct, but does not show duplicated characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants