Duplicate Characters in Output Stream #2738

woodjohndavid · 2019-10-31T00:15:52Z

Please refer to the following link:

This concerns changes made to lstm_choices_mode.

Unless I misunderstand what these options are supposed to do, it appears like there is a bug or oversight. Please refer to this user area thread:

https://groups.google.com/forum/#!topic/tesseract-ocr/5tC6appoUgE

There seems to be no way to prevent lstm from including duplicates in the generated text and/or HOCR output. The example in the thread above is a clear example of this.

Surely there must be some way to force Tesseract to include only the highest confidence level choice of character when there are multiple possibilities.

Also, apologies if this is posted in the wrong place, and apologies for possible duplicate postings. I am a Tesseract newbie so trying to learn the ropes.

Thanks.

bertsky · 2019-12-02T12:53:58Z

IMO it's perfectly legitimate to raise this issue again here. It has already surfaced several times under different names and descriptions, e.g. #1465. The usual recommendation is to improve the model quality. And this does of course help in reducing the likelihood of this happening. But nevertheless the underlying flaw (and you could also call it a bug) in the basic CTC implementation is still there. And it is more likely to surface when decoding less probable output segments (as happens with lstm_choice_iterations BTW).

I have (tentatively) termed the phenomenon of fake CTC duplicates diplopia, and recommended using Equal Spacing CTC or similar as a mitigation.

woodjohndavid · 2019-12-02T18:15:03Z

Thanks for the response Bertsky. Hopefully someone will take a look at trying to fix this issue. In the meanwhile, what I have done is, using the character level HOCR output, implemented a scan of that output to identify characters whose box dimensions overlap 'significantly' and then select only the highest confidence level character from those duplicates.

Another small question: could you please tell me where to post issues (not just questions) about Tesseract? Is the Google tesseract-dev group active? My posting there received no response. Is this Github Issues section the right place?

bertsky · 2019-12-02T18:58:07Z

In the meanwhile, what I have done is, using the character level HOCR output, implemented a scan of that output to identify characters whose box dimensions overlap 'significantly' and then select only the highest confidence level character from those duplicates.

That's a very good workaround, and it would also work inside the beam decoder. It's only a question of finding the best parameter set (maximum confidence, minimum overlap absolute/relative) for different languages/scripts objectively (i.e. on large corpora)... But then again, if we had such a test system, we could quickly evaluate the impact of equal spacing CTC as well.

Another small question: could you please tell me where to post issues (not just questions) about Tesseract?

You are already in the right place for (possible) bugs and feature requests. As for mailing groups, I'm not qualified to answer that.

woodjohndavid · 2019-12-11T19:43:41Z

Hello again:

Well, it turns out that my workaround is not a good solution after all, as the character level box dimensions are not accurate in some cases. So this really needs to be promoted to being a bug of some kind, at least in so far as how the character level box dimensions are determined.

Attached is definitive proof of one case, although I have encountered many of them. This concerns the word "Cell" in the following sample image run through Tesseract. Attached are the following related files:

Sample Boxes Original.png - original image fed into Tesseract
Sample Boxes HOCR.hocr - full HOCR output from Tesseract for that image
Sample Box 1.png - screen shot taken of paint.net looking at box dimensions for letter 'C'
Sample Box 2.png - screen shot taken of paint.net looking at box dimensions for letter 'e'

Following is the snippet from the HOCR specific for the word "Cell" which is on its own near the center of the original image.

  <span class='ocrx_word' id='word_1_37' title='bbox 1094 604 1153 655; x_wconf 95'>
   <span class='ocrx_cinfo' title='x_bboxes 1094 611 1117 640; x_conf 99.545456'>C</span>
   <span class='ocrx_cinfo' title='x_bboxes 1107 604 1124 655; x_conf 99.56794'>e</span>
   <span class='ocrx_cinfo' title='x_bboxes 1118 617 1137 640; x_conf 99.500481'>l</span>
   <span class='ocrx_cinfo' title='x_bboxes 1139 608 1153 640; x_conf 99.421089'>l</span>
  </span>

If you examine this case, you will see that the box dimensions for the letters 'C' and 'e' overlap significantly, hence resulting in my attempted workaround for removing duplicates to remove the letter 'C' from my output. However, if you actually look at the boxes on the source image (see my paint.net screen shots) you will see that the box for the letter 'e' simply makes no sense and cannot possibly be what Tesseract used to extract the letter 'e' with a confidence level of 99.56.

I have encountered many such examples, a lot of them where the box dimensions used to correctly select a particular character cover an area which includes the previous or next character as well.

Sample Boxes HOCR.txt

bertsky · 2019-12-12T12:31:01Z

Thanks @woodjohndavid for providing details. I can confirm this with the current master. Here are all the boxes of that word:

That's clearly a bug.

Looking at the debug log with -c classify_debug_level=1, I see...

Processing word with lang eng at:Bounding box=(1094,530)->(1153,562)
Trying word using lang eng, oem 2
<null>=110 On [0, 2), scores= 100(i=83=0,00107) 99,9(C=1=0,0548), Mean=99,9364, max=99,9939
C=1 On [2, 6), scores= 92,9(<null>=110=7,04) 99,9(<null>=110=0,102) 0,401(<null>=110=99,6) 1,47e-05(<null>=110=99,8), Mean=48,3003, max=99,8814
e=90 On [6, 9), scores= 92,3(<null>=110=7,64) 99,9(<null>=110=0,0713) 12,9(<null>=110=87,1), Mean=68,3886, max=99,9144
l=87 On [9, 13), scores= 1,02(<null>=110=99) 97,7(<null>=110=2,32) 98(<null>=110=1,92) 2,64(<null>=110=97,4), Mean=49,8415, max=98,0281
l=87 On [13, 16), scores= 30,8(<null>=110=69,2) 99,9(|=59=0,0643) 1,06(<null>=110=98,9), Mean=43,8997, max=99,8603

...(from LSTMRecognizer::LabelsFromOutputs / DebugActivationPath), and its underlying pixel-wise sequence...

0 null_char score=-0,191388, c=-0,191388, perm=2, hash=0
1 null_char score=-0,385364, c=-0,193976, perm=2, hash=0 prev:null_char score=-0,191388, c=-0,191388, perm=2, hash=0
2 label=1, uid=3=C [43 ]A score=-0,577528, c=-0,192164, perm=2, hash=1 prev:null_char score=-0,385364, c=-0,193976, perm=2, hash=0
3 label=1, uid=3=C [43 ]A score=-0,771448, c=-0,19392, perm=2, hash=1 prev:label=1, uid=3=C [43 ]A score=-0,577528, c=-0,192164, perm=2, has h=1
4 label=1, uid=3=C [43 ]A score=-0,96271, c=-0,191262, perm=2, hash=1 prev:label=1, uid=3=C [43 ]A score=-0,771448, c=-0,19392, perm=2, hash =1
5 null_char score=-1,15898, c=-0,196274, perm=2, hash=1 prev:label=1, uid=3=C [43 ]A score=-0,96271, c=-0,191262, perm=2, hash=1
6 label=90, uid=92=e [65 ]a score=-1,3505, c=-0,191512, perm=2, hash=c9 prev:null_char score=-1,15898, c=-0,196274, perm=2, hash=1
7 label=90, uid=92=e [65 ]a score=-1,54367, c=-0,193177, perm=2, hash=c9 prev:label=90, uid=92=e [65 ]a score=-1,3505, c=-0,191512, perm=2, hash=c9
8 label=90, uid=92=e [65 ]a score=-1,73536, c=-0,191687, perm=2, hash=c9 prev:label=90, uid=92=e [65 ]a score=-1,54367, c=-0,193177, perm=2, hash=c9
9 label=87, uid=89=l [6c ]a score=-1,92683, c=-0,191467, perm=2, hash=577e prev:label=90, uid=92=e [65 ]a score=-1,73536, c=-0,191687, perm=2, hash=c9
10 label=87, uid=89=l [6c ]a score=-2,17104, c=-0,244217, perm=2, hash=577e prev:label=87, uid=89=l [6c ]a score=-1,92683, c=-0,191467, perm=2, hash=577e
11 label=87, uid=89=l [6c ]a score=-2,36344, c=-0,192399, perm=2, hash=577e prev:label=87, uid=89=l [6c ]a score=-2,17104, c=-0,244217, perm=2, hash=577e
12 null_char score=-2,61503, c=-0,251586, perm=2, hash=577e prev:label=87, uid=89=l [6c ]a score=-2,36344, c=-0,192399, perm=2, hash=577e
13 label=87, uid=89=l [6c ]a score=-2,80721, c=-0,192181, perm=2, hash=25eff9 prev:null_char score=-2,61503, c=-0,251586, perm=2, hash=577e
14 label=87, uid=89=l [6c ]a score=-3,0016, c=-0,194395, perm=2, hash=25eff9 prev:label=87, uid=89=l [6c ]a score=-2,80721, c=-0,192181, perm=2, hash=25eff9
15 label=87, uid=89=l [6c ]a score=-3,19285, c=-0,19125, perm=2, hash=25eff9 prev:label=87, uid=89=l [6c ]a score=-3,0016, c=-0,194395, perm=2, hash=25eff9

...(from RecodeBeamSearch::DebugPath) which looks fine. But it's not what we see above as segmentation, and that derives from Tesseract::SearchWords.

@stweil, do you think this could be related to your and Noah's fixes in #2576?

woodjohndavid · 2019-12-12T19:19:03Z

Thanks Bertsky for confirming the issue.

As a Tesseract newbie, could I impose upon you yet again to give me some idea of when and how bugs are prioritized and potentially worked on? Is there any Tesseract development activity actually underway at this point?

I understand fully that Tesseract is open source, and hence I have no basis for any expectations whatsoever. But I would like to understand what the current state of development activity is.

I doubt that I have the necessary technical skills to contribute to Tesseract development, but would be interested to know how one gets involved in that if one chooses.

Thanks in advance for whatever light you can shed on this for me.

bertsky · 2019-12-13T00:03:08Z

@woodjohndavid I can only give you my personal impression on the questions you just raised. This is obviously a diverse and open community, perspectives and circumstances of contributers/developers vary substantially.

What gets done how soon depends on many things, notably:

does the issue likely concern many others (and how severely)?
do potential contributers with knowledge of the relevant parts of the code currently have spare time or funding?

For current development efforts, cf. https://github.com/tesseract-ocr/tesseract/wiki/Planning.

If you want to contribute yourself,

First of all, read the documentation in the wiki.
Make sure you understood the contribution guidelines.
Apart from that, it's no more than forking, opening a PR and being patient :-)

woodjohndavid · 2019-12-13T00:46:01Z

OK thanks @bertsky, much appreciated. I realize also that this is not the right forum for these kind of learning questions, but I have had little luck in getting anyone else to respond to them. So just one more, if you would be so kind: is there a leader or manager of the code base responsible for some kind of vetting of contributions before they enter the main code branch? If so, who?

Thanks again.

bertsky · 2019-12-13T07:55:43Z

There are people here with write permissions, but the reviewing work itself is usually shared. You can find more out by looking at the closed PRs or the contributer list.

woodjohndavid · 2020-01-21T00:02:13Z

Is there any likelihood that the issue of inaccurate character level bounding box dimensions will be addressed sometime soon? Of course, the real underlying issue is that the Tesseract LSTM engine is including multiple alternative characters in the output stream. However, it seems likely that the latter issue would be harder to correct. If the character level box dimensions could be made accurate, then the workaround that I proposed earlier in this thread for the duplicate character issue would in fact work.

RicketyRick · 2020-01-21T08:01:39Z

I second this. The wrong character level box stops us and our partner companies to use tesseract and we need to subscribe to these bad and expensive APIs of Abbyy and OmniPage. I would rather use Tesseract.

stweil · 2020-01-21T08:18:57Z

@woodjohndavid, @RicketyRick, the development process is currently entirely community driven. Code changes are provided by volunteers who might have other priorities than you.

So it is up to you to find and suggest a solution by providing a pull request - unless someone else does it.

RicketyRick · 2020-01-21T08:49:30Z

@stweil thank you, I will try, but the codebase is really big. Is there any help to find a short cut to the sources that might be of interest concerning the bounding box issue?

woodjohndavid · 2020-01-21T19:07:47Z

There are numerous overlapping issues that have been raised related to this same subject. In perusing a few of them, the names that come up frequently include @Sintun @theraysmith @jbreiden @stweil @noahmetzger who seem to be knowledgeable in this area of functionality and code. Perhaps those gentlemen could give some direction on where to look in the code.

This seems to be directly related to #2576

ghost · 2020-03-19T14:27:07Z

Hi,

I believe I have the same issue with the following style of input tables :

Tesseract gives good enough results with -psm 6 ( except it doesn't skip the divider bars of the table, so I have to clean up with sed to delete all [ | \ { and the such that it adds in the middle of the data...)
Surprisingly, if I run tesseract after first cleaning up the image to remove the table separators, the results are not as good, and tesseract mixes up Os and 0s, which it doesn't do if I leave the vertical bars..
In all cases though, I randomly get double characters (O0 or OQ etc) when tesseract isn't sure which it is. If I run with hocr, all the random characters are associated with very low wconf.

While waiting for a fix, is there any way to teach tesseract the structure of each line ? All lines are the same and columns can only contain one type of data, digits, or characters...

Thank you very much and have a good day

bertsky · 2020-03-19T15:14:15Z

@clavelc, most of what you say is not related – please help keeping issues to the point!

I have to clean up with sed to delete all [ | \ { and the such that it adds in the middle of the data...)

You can do that easier with a parameter: tesseract -c tessedit_char_blacklist="[|\\{" (or SetVariable() in API).

Surprisingly, if I run tesseract after first cleaning up the image to remove the table separators, the results are not as good

Yes, your columns are very close to each other, so the lines should help.

While waiting for a fix, is there any way to teach tesseract the structure of each line ? All lines are the same and columns can only contain one type of data, digits, or characters...

Yes you can: for this kind of table, you can easily use the --user-patterns feature of the CLI (see man-page and wiki). For more complicated cases, you can always do segmentation separately, then crop segment images, and run in PSM_SINGLE_LINE with very strict user patterns or char whitelist.

ghost · 2020-03-19T16:58:11Z

Thanks for your answer,

please help keeping issues to the point!

Sorry for that, will do !

you can easily use the --user-patterns feature of the CLI

Thanks for the tip, I looked up --user-pattern the other day but couldn't figure out how to apply it to my table. I'll try again.

Have a good day

woodjohndavid · 2021-06-01T22:09:22Z

I have downloaded the latest master code branch version and am experimenting with the code under Ubuntu on two fronts:

The original purpose of this thread, which is the inclusion of multiple characters in the output feed for what is essentially the same character position in the incoming image. Interestingly enough, the current version from master is somewhat improved in this regard, as some samples of this problem from earlier on using Tesseract Windows version tesseract-ocr-w64-setup-v5.0.0-alpha.20191030 seem to be working now. However, I have searched the pull request log and see nothing in there that would seem to be related to correcting this issue, and there are still some of my test cases which demonstrate the problem. See the attachment for one small sample of same for which the latest master Tesseract comes up with '10of3'. I am intending to work with the fix suggested in issue Character confusion fix suggestion #3144 to see if that may be a path forward. I could be wrong, but progress seems to have stalled at this point on that thread.
The secondary purpose which this thread morphed into, which is the issue with the inaccuracy of the character level box dimensions in the HOCR output when using the LSTM engine. I had attempted a workaround for the multiple character problem by using those box dimensions to identify characters with more-or-less the same image position, and to try and select the one with the highest confidence level. However, this workaround turns out to be a non-starter since the box dimensions cannot be relied upon. I have done some investigation on this subject, and will create a separate issue dedicated to that problem. However, this is largely a red herring in my situation, since the only reason I cared about it was if it could be used to solve the multiple overlapping character problem.

woodjohndavid · 2024-03-13T22:41:21Z

I have just created pull request #4211 which I consider to be an improved solution for diplopia.

I encourage everyone on this trail to try this out and test it with as broad a range of cases as possible.

Note by the way, there are some new configuration values that can only be set in code as things stand. These configuration values are:

bool kRemoveDiplopia - if true, enables diplopia removal functionality. If false, my changes have no effect
int kMaxDiplopiaGap - maximum number of timesteps apart to be considered diplopia, default 2

Obviously if my diplopia change is of value, then these configuration items should be made into settings.

bertsky mentioned this issue Dec 2, 2019

Lstm choice ril #2635

Merged

stweil added the accuracy label Dec 2, 2019

This was referenced Dec 17, 2019

Invented Characters Included in Output Stream #2824

Closed

Overlapping Character Boundingboxes #2825

Open

stweil mentioned this issue Oct 30, 2020

Character confusion fix suggestion #3144

Open

bertsky mentioned this issue Mar 16, 2021

Tesseract inserting additional alternative characters #1465

Open

amitdo added the diplopia label Mar 17, 2021

woodjohndavid mentioned this issue Jun 29, 2021

LSTM Engine Diplopia Issue and Inaccurate HOCR Character Level Box Dimensions #3477

Open

stweil added this to the 6.0.0 milestone Aug 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate Characters in Output Stream #2738

Duplicate Characters in Output Stream #2738

woodjohndavid commented Oct 31, 2019

bertsky commented Dec 2, 2019

woodjohndavid commented Dec 2, 2019

bertsky commented Dec 2, 2019

woodjohndavid commented Dec 11, 2019

bertsky commented Dec 12, 2019

woodjohndavid commented Dec 12, 2019

bertsky commented Dec 13, 2019

woodjohndavid commented Dec 13, 2019

bertsky commented Dec 13, 2019

woodjohndavid commented Jan 21, 2020

RicketyRick commented Jan 21, 2020

stweil commented Jan 21, 2020

RicketyRick commented Jan 21, 2020

woodjohndavid commented Jan 21, 2020 •

edited

Loading

ghost commented Mar 19, 2020

bertsky commented Mar 19, 2020

ghost commented Mar 19, 2020

woodjohndavid commented Jun 1, 2021

woodjohndavid commented Mar 13, 2024

Duplicate Characters in Output Stream #2738

Duplicate Characters in Output Stream #2738

Comments

woodjohndavid commented Oct 31, 2019

bertsky commented Dec 2, 2019

woodjohndavid commented Dec 2, 2019

bertsky commented Dec 2, 2019

woodjohndavid commented Dec 11, 2019

bertsky commented Dec 12, 2019

woodjohndavid commented Dec 12, 2019

bertsky commented Dec 13, 2019

woodjohndavid commented Dec 13, 2019

bertsky commented Dec 13, 2019

woodjohndavid commented Jan 21, 2020

RicketyRick commented Jan 21, 2020

stweil commented Jan 21, 2020

RicketyRick commented Jan 21, 2020

woodjohndavid commented Jan 21, 2020 • edited Loading

ghost commented Mar 19, 2020

bertsky commented Mar 19, 2020

ghost commented Mar 19, 2020

woodjohndavid commented Jun 1, 2021

woodjohndavid commented Mar 13, 2024

woodjohndavid commented Jan 21, 2020 •

edited

Loading