Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

also fill PAGE's glyphs and its variants and confidences via GetIterator() in recognize.py #7

Closed
bertsky opened this issue Apr 20, 2018 · 9 comments · Fixed by #110
Closed
Assignees

Comments

@bertsky
Copy link
Collaborator

bertsky commented Apr 20, 2018

For OCR postcorrection, TextLine.Word.Glyph.TextEquiv can be more valueable than just TextLine.TextEquiv. It allows to build up a lattice (or rather, confusion network) of alternative character hypotheses to (re)build words and phrases from. The PAGE notion of character hypotheses is glyph variants, i.e. a sequence of TextEquiv with index and conf (confidence) attributes. This does not help in addressing segmentation ambiguity (especially on the word level, since PAGE enforces a hierarchy of Word). But most ambiguity on the character level can still be captured.

Example:

<TextLine id="...">
  <Coords points="..."/>
  <Word id="...">
    <Coords points="..."/>
    <Glyph id="...">
      <Coords points="..."/>
      <TextEquiv>
        <Unicode>a</Unicode>
      </TextEquiv>
    </Glyph>
    <Glyph id="...">
      <Coords points="..."/>
      <TextEquiv index="0" conf="0.6">
        <Unicode>m</Unicode>
      </TextEquiv>
      <TextEquiv index="1" conf="0.3">
        <Unicode>rn</Unicode>
      </TextEquiv>
      <TextEquiv index="2" conf="0.1">
        <Unicode>in</Unicode>
      </TextEquiv>
    </Glyph>
  </Word>
  <Word id="...">
    ...
  </Word>
  <TextEquiv>
    <Unicode>am Ende</Unicode>
  </TextEquiv>
</TextLine>

So this part of the wrapper should also dive into the word and character/glyph substructure as a complementary level of annotation. Tesseract's API seems to be straightforward for this use case: baseapi.h contains GetIterator() giving a ResultIterator, which allows to recurse across RIL_SYMBOL as PageIteratorLevel. For each glyph then a GetUTF8Text() and Confidence() yield what we need.

@kba
Copy link
Member

kba commented Apr 26, 2018

Thanks for you feedback, much appreciated, short update on progress: We've been laying the groundwork last week to make this possible by making the PAGE library in OCR-D/core more flexible. Now we're discussing how to parameterize segmentation/recognition so the Tesseract API can be used on different levels with different langdata etc. Once that is settled, recognizing on glyph level and keeping confidence should be straightforward to implement.

@bertsky
Copy link
Collaborator Author

bertsky commented Aug 5, 2018

Implemented a prototype. This does add glyph annotation with confidences, but alternative hypotheses are only available with Tesseract 3 models (while LSTM models just give the first reading).

Not sure if that change can help.

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 25, 2018

Not sure if that change can help.

This and related tesseract-ocr/tesseract#1851 and tesseract-ocr/tesseract#1997 do not in fact service the old API (LTRResultIterator::GetChoiceIterator), but instead merely introduce a new function (TessBaseAPI::GetBestLSTMChoices), which will become available with release 4.0. So either we adapt to that, or we wait for the old API to be fixed as well.

@kba
Copy link
Member

kba commented Nov 13, 2018

@noahmetzger @bertsky What's the status here?

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 13, 2018

Still the same as 3 weeks ago AFAICT. GetBestLSTMChoices is a good start (especially for independent experiments), but I still hesitate to adapt to it here: I would like to keep both old (pre-LSTM) and new models running and producing consistent results. Maybe if we could at least query the API about old vs new, then we could attempt a different backend. But I do not see any. (There is get_languages (giving the names of the models) and the OEM class, corresponding to TessOcrEngineMode enum. But nothing that says "this language is that mode".)

What is worse, we cannot go further as long as tesserocr does not migrate to the 4.0 codebase. (It does not currently build at all.) See here for Noah's PR, which also needs to be updated.

@noahmetzger
Copy link
Contributor

I will update the PR this or next week, but we are currently working on another part of the Project.

@wrznr
Copy link
Contributor

wrznr commented Apr 17, 2019

@noahmetzger @bertsky What's the status here?

@stweil
Copy link
Contributor

stweil commented Apr 17, 2019

@noahmetzger can be reached again after Easter Monday. His main focus is currently implementing the requirements for @bertsky in Tesseract.

@bertsky
Copy link
Collaborator Author

bertsky commented Apr 17, 2019

Noah did make changes that populate the old iterator API on LSTM engine, which have been merged already. But as I argued elsewhere, this cannot produce correct scores (for everything besides the best path) and may produce illegal characters (because it does not respect the incremental character encoding of the beam).

Also, when trying lattice output, character whitelisting and user patterns/words I observed that the current beam search is too narrow anyway.

So we are currently working on 2 possible solutions:

  • Noah will try to find a way to correctly re-integrate partial hypotheses (which fell off the beam), and increase the overly restrictive entrance width (which currently only uses the top 2 outputs per timestep)
  • I will try to rewrite the beam search from its current depth-first to a breadth-first approach (iterating over a pool of candidates, sorted by a score normalised with their length and prospective cost – A* search –, instead of iterating over timesteps strictly left to right), which should be faster and better (judging from my LM and post-correction experiences with beam search), and would also give better control of the search effort

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants