-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
also fill PAGE's glyphs and its variants and confidences via GetIterator() in recognize.py #7
Comments
Thanks for you feedback, much appreciated, short update on progress: We've been laying the groundwork last week to make this possible by making the PAGE library in OCR-D/core more flexible. Now we're discussing how to parameterize segmentation/recognition so the Tesseract API can be used on different levels with different langdata etc. Once that is settled, recognizing on glyph level and keeping confidence should be straightforward to implement. |
Implemented a prototype. This does add glyph annotation with confidences, but alternative hypotheses are only available with Tesseract 3 models (while LSTM models just give the first reading). Not sure if that change can help. |
This and related tesseract-ocr/tesseract#1851 and tesseract-ocr/tesseract#1997 do not in fact service the old API ( |
@noahmetzger @bertsky What's the status here? |
Still the same as 3 weeks ago AFAICT. What is worse, we cannot go further as long as tesserocr does not migrate to the 4.0 codebase. (It does not currently build at all.) See here for Noah's PR, which also needs to be updated. |
I will update the PR this or next week, but we are currently working on another part of the Project. |
@noahmetzger @bertsky What's the status here? |
@noahmetzger can be reached again after Easter Monday. His main focus is currently implementing the requirements for @bertsky in Tesseract. |
Noah did make changes that populate the old iterator API on LSTM engine, which have been merged already. But as I argued elsewhere, this cannot produce correct scores (for everything besides the best path) and may produce illegal characters (because it does not respect the incremental character encoding of the beam). Also, when trying lattice output, character whitelisting and user patterns/words I observed that the current beam search is too narrow anyway. So we are currently working on 2 possible solutions:
|
For OCR postcorrection,
TextLine.Word.Glyph.TextEquiv
can be more valueable than justTextLine.TextEquiv
. It allows to build up a lattice (or rather, confusion network) of alternative character hypotheses to (re)build words and phrases from. The PAGE notion of character hypotheses is glyph variants, i.e. a sequence ofTextEquiv
withindex
andconf
(confidence) attributes. This does not help in addressing segmentation ambiguity (especially on the word level, since PAGE enforces a hierarchy ofWord
). But most ambiguity on the character level can still be captured.Example:
So this part of the wrapper should also dive into the word and character/glyph substructure as a complementary level of annotation. Tesseract's API seems to be straightforward for this use case:
baseapi.h
containsGetIterator()
giving aResultIterator
, which allows to recurse acrossRIL_SYMBOL
asPageIteratorLevel
. For each glyph then aGetUTF8Text()
andConfidence()
yield what we need.The text was updated successfully, but these errors were encountered: