also fill PAGE's glyphs and its variants and confidences via GetIterator() in recognize.py #7

bertsky · 2018-04-20T16:43:48Z

For OCR postcorrection, TextLine.Word.Glyph.TextEquiv can be more valueable than just TextLine.TextEquiv. It allows to build up a lattice (or rather, confusion network) of alternative character hypotheses to (re)build words and phrases from. The PAGE notion of character hypotheses is glyph variants, i.e. a sequence of TextEquiv with index and conf (confidence) attributes. This does not help in addressing segmentation ambiguity (especially on the word level, since PAGE enforces a hierarchy of Word). But most ambiguity on the character level can still be captured.

Example:

<TextLine id="...">
  <Coords points="..."/>
  <Word id="...">
    <Coords points="..."/>
    <Glyph id="...">
      <Coords points="..."/>
      <TextEquiv>
        <Unicode>a</Unicode>
      </TextEquiv>
    </Glyph>
    <Glyph id="...">
      <Coords points="..."/>
      <TextEquiv index="0" conf="0.6">
        <Unicode>m</Unicode>
      </TextEquiv>
      <TextEquiv index="1" conf="0.3">
        <Unicode>rn</Unicode>
      </TextEquiv>
      <TextEquiv index="2" conf="0.1">
        <Unicode>in</Unicode>
      </TextEquiv>
    </Glyph>
  </Word>
  <Word id="...">
    ...
  </Word>
  <TextEquiv>
    <Unicode>am Ende</Unicode>
  </TextEquiv>
</TextLine>

So this part of the wrapper should also dive into the word and character/glyph substructure as a complementary level of annotation. Tesseract's API seems to be straightforward for this use case: baseapi.h contains GetIterator() giving a ResultIterator, which allows to recurse across RIL_SYMBOL as PageIteratorLevel. For each glyph then a GetUTF8Text() and Confidence() yield what we need.

The text was updated successfully, but these errors were encountered:

kba · 2018-04-26T14:37:58Z

Thanks for you feedback, much appreciated, short update on progress: We've been laying the groundwork last week to make this possible by making the PAGE library in OCR-D/core more flexible. Now we're discussing how to parameterize segmentation/recognition so the Tesseract API can be used on different levels with different langdata etc. Once that is settled, recognizing on glyph level and keeping confidence should be straightforward to implement.

bertsky · 2018-08-05T22:44:26Z

Implemented a prototype. This does add glyph annotation with confidences, but alternative hypotheses are only available with Tesseract 3 models (while LSTM models just give the first reading).

Not sure if that change can help.

bertsky · 2018-10-25T13:57:26Z

Not sure if that change can help.

This and related tesseract-ocr/tesseract#1851 and tesseract-ocr/tesseract#1997 do not in fact service the old API (LTRResultIterator::GetChoiceIterator), but instead merely introduce a new function (TessBaseAPI::GetBestLSTMChoices), which will become available with release 4.0. So either we adapt to that, or we wait for the old API to be fixed as well.

kba · 2018-11-13T09:13:14Z

@noahmetzger @bertsky What's the status here?

bertsky · 2018-11-13T09:34:44Z

Still the same as 3 weeks ago AFAICT. GetBestLSTMChoices is a good start (especially for independent experiments), but I still hesitate to adapt to it here: I would like to keep both old (pre-LSTM) and new models running and producing consistent results. Maybe if we could at least query the API about old vs new, then we could attempt a different backend. But I do not see any. (There is get_languages (giving the names of the models) and the OEM class, corresponding to TessOcrEngineMode enum. But nothing that says "this language is that mode".)

What is worse, we cannot go further as long as tesserocr does not migrate to the 4.0 codebase. (It does not currently build at all.) See here for Noah's PR, which also needs to be updated.

noahmetzger · 2018-11-14T08:27:30Z

I will update the PR this or next week, but we are currently working on another part of the Project.

wrznr · 2019-04-17T16:39:44Z

@noahmetzger @bertsky What's the status here?

stweil · 2019-04-17T17:04:46Z

@noahmetzger can be reached again after Easter Monday. His main focus is currently implementing the requirements for @bertsky in Tesseract.

bertsky · 2019-04-17T17:07:28Z

Noah did make changes that populate the old iterator API on LSTM engine, which have been merged already. But as I argued elsewhere, this cannot produce correct scores (for everything besides the best path) and may produce illegal characters (because it does not respect the incremental character encoding of the beam).

Also, when trying lattice output, character whitelisting and user patterns/words I observed that the current beam search is too narrow anyway.

So we are currently working on 2 possible solutions:

Noah will try to find a way to correctly re-integrate partial hypotheses (which fell off the beam), and increase the overly restrictive entrance width (which currently only uses the top 2 outputs per timestep)
I will try to rewrite the beam search from its current depth-first to a breadth-first approach (iterating over a pool of candidates, sorted by a score normalised with their length and prospective cost – A* search –, instead of iterating over timesteps strictly left to right), which should be faster and better (judging from my LM and post-correction experiences with beam search), and would also give better control of the search effort

bertsky mentioned this issue Aug 6, 2018

Fill glyph variants after recognition #15

Merged

bertsky mentioned this issue Aug 14, 2018

allow intermediate PAGE annotation for word segmentation ambiguity OCR-D/spec#72

Open

kba assigned noahmetzger and bertsky Nov 13, 2018

bertsky mentioned this issue Jan 24, 2020

recognize: use lstm_choice_mode=2 for textequiv_level=glyph #110

Merged

kba closed this as completed in #110 Jan 24, 2020

bertsky mentioned this issue Jan 24, 2020

all: add dpi parameter as manual override to image metadata #108

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

also fill PAGE's glyphs and its variants and confidences via GetIterator() in recognize.py #7

also fill PAGE's glyphs and its variants and confidences via GetIterator() in recognize.py #7

bertsky commented Apr 20, 2018

kba commented Apr 26, 2018

bertsky commented Aug 5, 2018 •

edited

Loading

bertsky commented Oct 25, 2018

kba commented Nov 13, 2018

bertsky commented Nov 13, 2018

noahmetzger commented Nov 14, 2018

wrznr commented Apr 17, 2019

stweil commented Apr 17, 2019

bertsky commented Apr 17, 2019

also fill PAGE's glyphs and its variants and confidences via GetIterator() in recognize.py #7

also fill PAGE's glyphs and its variants and confidences via GetIterator() in recognize.py #7

Comments

bertsky commented Apr 20, 2018

kba commented Apr 26, 2018

bertsky commented Aug 5, 2018 • edited Loading

bertsky commented Oct 25, 2018

kba commented Nov 13, 2018

bertsky commented Nov 13, 2018

noahmetzger commented Nov 14, 2018

wrznr commented Apr 17, 2019

stweil commented Apr 17, 2019

bertsky commented Apr 17, 2019

bertsky commented Aug 5, 2018 •

edited

Loading