Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Remove the legacy OCR Engine #707

Closed
amitdo opened this issue Feb 7, 2017 · 106 comments
Closed

RFC: Remove the legacy OCR Engine #707

amitdo opened this issue Feb 7, 2017 · 106 comments

Comments

@amitdo
Copy link
Collaborator

amitdo commented Feb 7, 2017

Ray wants to get rid of the legacy OCR engine, so that the final 4.00 version will only have one OCR engine based on LSTM.

From #518:

@stweil commented:

I strongly vote against removing non-LSTM as we currently still get better results with it in some cases.

@theraysmith commented:

Please provide examples of where you get better results with the old engine.
Right now I'm trying to work on getting rid of redundant code, rather than spending time fighting needless changes that generate a lot of work. I have recently tested an LSTM-based OSD, and it works a lot better than the old, so that is one more use of the old classifier that can go. AFAICT, apart from the equation detector, the old classifier is now redundant.

@Shreeshrii
Copy link
Collaborator

Is the intention that the legacy OCR engine will be available in 3.0x branch and LSTM engine in the 4.0 version?

@harinath141
Copy link

I support @theraysmith removing the legacy OCR engine as we are getting better results in LSTM-based,
however we have to increase support to multilanguage and need many fixes to 4.0 final..

@amitdo
Copy link
Collaborator Author

amitdo commented Feb 8, 2017

My personal opinion is that we should drop the old engine. It will be much easier to maintain and support Tesseract in this form. I also support dropping the OpenCL code.

@amitdo
Copy link
Collaborator Author

amitdo commented Feb 8, 2017

I also think we should release a last 3.0x version in the upcoming 2-6 weeks.

@egorpugin
Copy link
Contributor

egorpugin commented Feb 8, 2017

+ for dropping (in case of better results of lstm engine)

@atuyosi
Copy link
Contributor

atuyosi commented Feb 8, 2017

I cannot agree with removing old ocr engine, until new lstm engine has support vertical text.

Of course I know that the new LSTM engine is very good ( in Japanese text including English words especially).
In the meantime, maintaining the old engine provides the option of using the old OCR engine only for vertical text.

c.f. #627 , #641

@theraysmith
Copy link
Contributor

theraysmith commented Feb 8, 2017 via email

@zdenop
Copy link
Contributor

zdenop commented Feb 8, 2017

If 3.05 should be the last version with legacy OCR Engine (old engine) then there should be possibility to read OCR result from memory.

Also it would be great if 3.05 and 4.0 version could be installed at the same time (AFAIK there are conflict with tessdata filenames: they are the same but they are not compatible)

@amitdo
Copy link
Collaborator Author

amitdo commented Feb 8, 2017

👍 for a side-by-side 3.05 and 4.00.

A possible way to achieve this goal:
For 3.05 you can append 3 to libtesseract and all the installed programs.
The traineddata will live in .../share/tessdata3.

@zdenop
Copy link
Contributor

zdenop commented Feb 8, 2017

I would prefer to be as much consistent as possible: e.g. if 3.02 and 3.04 use tessdata also 3.05 should. So 4.0 should start with change...

@egorpugin
Copy link
Contributor

Yes, if later we'll have 5.0 with different data files, they'll use tesseract5 and this won't break anything.
If we have tesseract for 4.0, then it will be renamed to tesseract4 again, and tesseract for 5.0 - that's not good.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Feb 8, 2017 via email

@stweil
Copy link
Contributor

stweil commented Feb 8, 2017

A simple solution could be using tessdata/4, tessdata/5 and so on for new major versions, so we continue using a tessdata directory at the same location as before, but automatically add the major version as the name of a subdirectory. If Tesseract uses semantic versioning in the future, I see no need to add a second number (although that would be possible, resulting in tesseract/4.0).

For the program names, we can look for existing examples. I just checked my /usr/bin/*[0-9] files and found names like clang-3.8, gcc-6, php5, php-7.0, ruby2.1. So there is no clear convention whether to separate name and version by a dash or not and whether to use major version only or both major and minor version. With semantic versioning the major version should be sufficient again.

@theraysmith
Copy link
Contributor

theraysmith commented Feb 8, 2017 via email

@jbreiden
Copy link
Contributor

jbreiden commented Feb 10, 2017

https://wiki.ubuntu.com/ZestyZapus/ReleaseSchedule

Feb 16 is the final deadline for changess to Ubuntu 17.04. I am not comfortable shipping anything from 4.x to these users, but we can consider taking a snapshot of the 3.0.5 branch. It does have some bug and compatibility fixes that are good for users. Regarding training data, I would not ship an update that at all. This would be purely be a code update.

I know the long standing issue has been restoring an API call (last seen in version 3.0.2) to send results to memory instead of file. I respect that idea, but we don't have it, and it's not that easy to add. I think it is fair to say that it would be impossible before deadline. So the question is, do we ship an update to users this cycle or not. And if so, should I take a snapshot? And if so, what would it be called?

A few more thoughts that are somewhat related

  • I see no reason that this has to be the last ever release on the 3.0.x branch.
  • My guess is by the next next release in Oct 2017 that 4.x will be ready for the vast majority of users
  • I'm not planning to ship both 3.0.x and 4.x at the same time with Debian/Ubuntu. I think it will be very rare for people to want both, and those who do will be advanced users who can work from source code.

@Shreeshrii
Copy link
Collaborator

@jbreiden Good idea to do a code update for 3.05 for Ubuntu 17.04. There are a number of bug fixes and changes and it would be good to get them out to the users. Thanks!

@uvius
Copy link

uvius commented Feb 15, 2017

@theraysmith: Here I try to provide examples of where you get better results with the old engine.

I did a lot of LSTM training with OCRopus on real images of historical printings and noticed that LSTM recognition was inferior to classic Tesseract in these cases:

  1. glyphs rarely seen in training (capital letters, numbers, certain punctuations)
  2. unusual patterns (letter-spacing, e.g. R U N N I N G H E A D)
  3. very short lines (catchword at page end, page numbers)

My explanation is, that single letters get decoded using the combined evidence of the whole line. If this is either rare and unusual (1, 2) or mostly absent (3), decoding is uncertain, no matter how clearly single glyphs are printed and preserved (and therefore easily recognized by methods based on feature detection).

So I tried both the old (OEM = 0) and new (OEM = 1) recognizer on these 10 lines (the last line is a regular text line for comparison from a 1543 printing, where a trained model yields 99.5% accuracy for the book):

1 bin
2 bin
3 bin
4 bin
5 bin
6 bin
7 bin
8 bin
9 bin
10 bin

Old method: tesseract -l lat --oem 0 --psm 7:

17:
V.
SECVNDAE
B 3
LIBER
AD
Lxxxvxn.
zo PROGYMNASMATA
IN GENEROSVM ADOLESCEN-
cafiris millia paITuum circitér fcptem.Rc_x cum hoc itincrc szaré ucnirc

New method: tesseract -l lat --oem 1 --psm 7:

177:
V,.
SECV NDAHE
B- 5
LI B E D.
A D
Lx x XV II IL.
209 P R o cy M N ^ s M A T 4
IN GE NE R O SVM A D O L E S CE N-
caüris millia paiTuum circiter fcptcm.Rc-x cum hoc itinere Cæfarö uenit:

Admittedly, although this is all Latin text, the recognition looks much better without any language model (tesseract --oem 1 --psm 7):

17;
V.
SECV NDA E
B ;
LIB E R
A D
Lx X xv III.
40 PR 0 GY MN A S M A T a
IN GENEROSVM ADOLESCEN.
caftris millia pafluum circiter feptem. Rex cum hocitinere Cafaré uenire

But it still is less consistent than the old method in treating spacings. The last line shows the potential that may be reached when training on real images becomes available (long ſ, proper inter-word spacing model, historical glyphs).

So I vote for keeping the old code just for these edge cases which are otherwise hard to recognize at the same level of consistency.

@amitdo
Copy link
Collaborator Author

amitdo commented Feb 17, 2017

Admittedly, although this is all Latin text, the recognition looks much better without any language model

Without explicit -l LANG, Tesseract will use the eng traineddata, so

tesseract --oem 1 --psm 7

is equivalent to:

tesseract -l eng --oem 1 --psm 7

@amitdo
Copy link
Collaborator Author

amitdo commented Feb 21, 2017

@theraysmith commented in commit b453f74

There is always going to be a significant speed penalty for multi-lang mode.
The multi-lang mode could still do with more work to run it at a lower level, (inside RecognizeLine) but the legacy engine could do to go before that, or multi-lang could get really unnecessarily complex.

@solomennikm
Copy link
Contributor

I think that there is reason to keeping the old ocr engine while LSTM engine is not ideal.
This will allow use two engine simultaneously.
For example ABBYY uses several ocr methods in his OCR engine: Bayesian classifier with about 100 features, raster classifier, contour classifier, structure classifier and then differentiating classifiers.

@amitdo
Copy link
Collaborator Author

amitdo commented Mar 1, 2017

The problem is that the code for the old engine is too large and complex. As Ray indicated, keeping it will make improving the new LSTM engine much harder.

@zdenop
Copy link
Contributor

zdenop commented Mar 1, 2017

So there should be changes in 4.0 code so tesseract 4.x and 3.05.x could be installed at the same time

@egorpugin
Copy link
Contributor

Yes, who wants old engine could use 3 series when lstm will be available in 4.

@Shreeshrii
Copy link
Collaborator

#733

single letters recognized better with legacy

@amitdo
Copy link
Collaborator Author

amitdo commented Mar 8, 2017

From #744
theraysmith commented:

... yes I would still like to remove the old classifier and take out a lot of code with it.
I'm going to review the replies to my request for "old better than new", and thanks to those that provided them, with a view to making new better than old on those problems.

@amitdo
Copy link
Collaborator Author

amitdo commented Mar 8, 2017

From 518
theraysmith commented:

Please provide examples of where you get better results with the old engine.

@stweil commented 29 days ago:

I'll do that in the discussion of the new issue #707.

Stefan, we are still waiting for it ... :-)

@stweil
Copy link
Contributor

stweil commented Mar 8, 2017

I can confirm all problems reported above by @uvius. In addition, some training files currently only exist for 3.x (notably deu_frak) or have a bad quality (deu), so 4.0 does not improve the results for those languages.

I also had an example where a larger part of a page was missing in the output from LSTM while the old recognizer got most of that part correctly, but I am still searching to find that example again.

@amitdo
Copy link
Collaborator Author

amitdo commented Mar 8, 2017

As you know, unlike almost all the other files in the tessdata repo, the '_frak' traineddata files are not based on Google manpower (& machine-power) efforts.
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#fraktur-data-files

Maybe you and and your friends from @UB-Mannheim can prepare a new deu_frak traineddata for Tesseract 4.00 and share it under open source license (preferably Apache 2 or other permissive software license) ?

@amitdo
Copy link
Collaborator Author

amitdo commented Jul 18, 2018

#1074 (comment)

@amitdo
Copy link
Collaborator Author

amitdo commented Jul 18, 2018

Font size estimation is already supported for the lstm engine.

@makmanalp
Copy link

@amitdo er, should have excluded size from there, I mean style.

@amitdo
Copy link
Collaborator Author

amitdo commented Jul 18, 2018

Did you noticed my other comment above (a link to Ray's comment) ?

@amitdo
Copy link
Collaborator Author

amitdo commented Sep 19, 2018

So it seems that the legacy engine will stay in final 4.0.0.

I still have a question:

Has someone done serious testing on hundreds of pages to compare the result of:

  1. legacy vs. lstm
  2. lstm+legacy vs. lstm alone

- Just the text renderer with default psm (auto, no osd).

I'm interested in char and word error (CER & WER) statistics.

@stweil
Copy link
Contributor

stweil commented Sep 19, 2018

I am just running accuracy tests on 189 pages from our historic books. Currently I have results from ABBYY Fine Reader and Tesseract with fast Fraktur and PSM 1, 3 and 6. I also tested ScanTailor + Tesseract fast Fraktur PSM 1. Best Fraktur is just running. First results for CER median:

ABBYY: 10.5 %
PSM 1: 13.0 %
PSM 3: 15.0 %
PSM 6: 20.2 %
ScanTailor + PSM 1: 12.8 %

Detailed results will be published as soon as the tests are finished.

@stweil
Copy link
Contributor

stweil commented Sep 19, 2018

lstm+legacy is currently not usable for mass production, because chances are high that Tesseract will fail because of the well known assertion.

@amitdo
Copy link
Collaborator Author

amitdo commented Sep 19, 2018

Thanks for sharing!

What preprocessing options are used with ScanTailor?

I think ABBYY also has 'fast' and 'accurate' modes.

@stweil
Copy link
Contributor

stweil commented Sep 19, 2018

ScanTailor was used with scantailor-cli --color-mode=mixed --dewarping=auto.

ABBYY used the default mode with different language settings. I still have to look for effects caused by different handling of diacritica (for example Latin ground truth and ABBYY result without accents, but original text uses accents => Tesseract Fraktur detects accents).

The raw data is at https://digi.bib.uni-mannheim.de/~stweil/anciendroit/new/.

@amitdo amitdo changed the title Removing the legacy OCR Engine Removing the legacy OCR Engine [at version 5.0.0 or higher] Oct 3, 2018
@amitdo amitdo changed the title Removing the legacy OCR Engine [at version 5.0.0 or higher] [RFC] Remove the legacy OCR Engine [at version 5.0.0 or higher] Oct 3, 2018
@amitdo
Copy link
Collaborator Author

amitdo commented Oct 14, 2018

I decided to close this issue.

There is now an option to compile Tessseract 4.0.0 without the legacy engine code.

@amitdo amitdo closed this as completed Oct 14, 2018
@amitdo amitdo changed the title [RFC] Remove the legacy OCR Engine [at version 5.0.0 or higher] RFC: Remove the legacy OCR Engine May 16, 2020
@amitdo
Copy link
Collaborator Author

amitdo commented Oct 10, 2020

I succeed in dropping the legacy engine by dropping ~37K LOC.

It's more than 64k LOC now.

@amitdo
Copy link
Collaborator Author

amitdo commented Apr 20, 2021

@stweil

Can you run the accuracy tests again on the same dataset with master and/or latest tagged version, to make sure there is no regression?

@stweil
Copy link
Contributor

stweil commented Apr 20, 2021

Here are the results with latest Tesseract for the line images posted by @uvius. They are still not perfect, but much better than the old ones, especially with a new model which I recently have trained (based on ground truth published by @uvius and others).

--oem 1 --psm 7 -l tessdata/lat

17 :
V.
SECVNDAE
B 3
LIBER
AD
LxxxvIIL
20 PROGYMNASMATA
IN GENEROSVM ADOLESCEN-
caftris millia pafluum circiter feptem. R ex cum hoc itinere Cafíaré uenire

--oem 1 --psm 7 -l ubma/frak2021_0.905_1587027_9141630

17
V.
SECVNDAE
B 3
LIBER
AD
LXXXVIII.
20 PROGYVMNASMATA
IN GENEROSVM ADOLES CEN-
caſtris millia paſſuum circitèr ſeptem. Rex cum hoc itinere Cæſarẽ uenire

@stweil
Copy link
Contributor

stweil commented Apr 21, 2021

Can you run the accuracy tests again on the same dataset with master and/or latest tagged version, to make sure there is no regression?

tesseract-5.0.0-alpha-592-gb221f has only few different results on the 189 files of my test set, but some of those are significantly better (-=old, +=new):

-452114306_0024  85.67%  Accuracy
+452114306_0024  85.57%  Accuracy
-452117542_0022  20.94%  Accuracy
+452117542_0022  87.72%  Accuracy
-461732149_0012  89.80%  Accuracy
+461732149_0012  89.44%  Accuracy
-461732149_0158  15.09%  Accuracy
+461732149_0158  81.48%  Accuracy
-470857285_0979  93.70%  Accuracy
+470857285_0979  93.77%  Accuracy
-470875348_0608  89.93%  Accuracy
+470875348_0608  89.97%  Accuracy
-470901101_0034  86.81%  Accuracy
+470901101_0034  86.86%  Accuracy

Git master produces different results, some of them slightly worse, some are better. The most significant change with latest Tesseract is the time required to process the 189 pages. It dropped from 1638 s to 926 s. I think that both effects are caused by commits cfb1fb2 and eaf72ac.

@stweil
Copy link
Contributor

stweil commented Apr 21, 2021

With our latest model file and current Tesseract the median CER is reduced to 9 %. The execution time is now 635 s. That's much more than twice as fast compared to the old 4.0 results with much better accuracy.

The bad news is that latest Tesseract does not detect any text on two of the 189 pages. That requires a closer examination.

5.0.0-alpha-20201224 still was fine.

@wollmers
Copy link

@stweil

The bad news is that latest Tesseract does not detect any text on two of the 189 pages.

How do these pages look like?

I guess 470875348_0010.txt is a title page as page 0012 is the preface. Then it's the "title page problem" with maybe large letters in special design, extreme letterspacing etc. It gets better if I cut title pages into line images and the font style is similar to something trained.

Title pages seem to be underrepresented in training sets. It's a sort of selection bias.

@stweil
Copy link
Contributor

stweil commented Apr 21, 2021

How do these pages look like?

See original JPEG files 470875348_0010 and 452117542_0250.

@wollmers
Copy link

How do these pages look like?

See original JPEG files 470875348_0010 and 452117542_0250.

Tried it with

$ tesseract --version
tesseract 5.0.0-alpha-773-gd33ed
 leptonica-1.79.0

$ tesseract 452117542_0250.jpg 452117542_0250.GT4  -l GT4Hist2M  \ 
-c tessedit_write_images=true  --tessdata-dir /usr/local/share/tessdata  makebox hocr txt

$ tesseract 452117542_0250.jpg 452117542_0250.frak  -l ubma/frak2021_0.905_1587027_9141630 \ 
-c tessedit_write_images=true --tessdata-dir /usr/local/share/tessdata  makebox hocr txt

and get for GT4Hist2M

              lines   words   chars
items ocr:       58     204    1047 matches + inserts + substitutions
items grt:       56     198    1039 matches + deletions + substitutions
matches:         36     165    1000 matches
edits:           22      39      55 inserts + deletions + substitutions
 subss:          20      33      31 substitutions
 inserts:         2       6      16 inserts
 deletions:       0       0       8 deletions
precision:   0.6207  0.8088  0.9551 matches / (matches + substitutions + inserts)
recall:      0.6429  0.8333  0.9625 matches / (matches + substitutions + deletions)
accuracy:    0.6207  0.8088  0.9479 matches / (matches + substitutions + inserts + deletions)
f-score:     0.6316  0.8209  0.9588 ( 2 * recall * precision ) / ( recall + precision )
error:       0.3929  0.1970  0.0529 ( inserts + deletions + substitutions ) / (items grt )

and for ubma/frak2021_0.905_1587027_9141630

              lines   words   chars
items ocr:       58     202    1052 matches + inserts + substitutions
items grt:       56     198    1039 matches + deletions + substitutions
matches:         39     173    1014 matches
edits:           19      29      43 inserts + deletions + substitutions
 subss:          17      25      20 substitutions
 inserts:         2       4      18 inserts
 deletions:       0       0       5 deletions
precision:   0.6724  0.8564  0.9639 matches / (matches + substitutions + inserts)
recall:      0.6964  0.8737  0.9759 matches / (matches + substitutions + deletions)
accuracy:    0.6724  0.8564  0.9593 matches / (matches + substitutions + inserts + deletions)
f-score:     0.6842  0.8650  0.9699 ( 2 * recall * precision ) / ( recall + precision )
error:       0.3393  0.1465  0.0414 ( inserts + deletions + substitutions ) / (items grt )

Page 470875348_0010.jpg looks also nice, but I didn't spend time for a GT.txt.

Preprocessing issue?

The remaining CER of ~4 % is still high for the good image quality and a very common typeface (Garamond like).

@stweil
Copy link
Contributor

stweil commented Apr 21, 2021

The regression (empty page) was introduced by commit 5db92b2 and especially the modified declaration for PartSetVector.

Our GT data is available online:
https://digi.bib.uni-mannheim.de/fileadmin/digi/452117542/gt/
https://digi.bib.uni-mannheim.de/fileadmin/digi/470875348/gt/

Tesseract output includes long s and some other characters which must be normalized before the comparision with that ground truth.

@stweil
Copy link
Contributor

stweil commented Apr 21, 2021

The remaining CER of ~4 % is still high for the good image quality and a very common typeface (Garamond like).

That's not surprising because our training data had no focus on such typefaces. It used GT4HistOCR (mix from early prints until 19th century), Austrian newspapers (mainly Fraktur) and German primers from 19th century (Fraktur and handwritten script).

Another and maybe even more dominant reason is the current quality of the ground truth texts. They were produced by a single person (no 2nd proof reader), and I just noticed that it also includes comments added by that person. In addition it includes numerous violations of our transcription guidelines, for examples blanks before comma and similar issues. So ground truth errors contribute to the CER.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests