-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Remove the legacy OCR Engine #707
Comments
Is the intention that the legacy OCR engine will be available in 3.0x branch and LSTM engine in the 4.0 version? |
I support @theraysmith removing the legacy OCR engine as we are getting better results in LSTM-based, |
My personal opinion is that we should drop the old engine. It will be much easier to maintain and support Tesseract in this form. I also support dropping the OpenCL code. |
I also think we should release a last 3.0x version in the upcoming 2-6 weeks. |
|
I cannot agree with removing old ocr engine, until new lstm engine has support vertical text. Of course I know that the new LSTM engine is very good ( in Japanese text including English words especially). |
It will support vertical text.
I have an experimental implementation that treats it as an additional
language, but it would be possible to make it depend on the layout analysis
instead.
…On Wed, Feb 8, 2017 at 6:18 AM, Atsuyoshi SUZUKI ***@***.***> wrote:
I cannot agree with removing old ocr engine, until new lstm engine has
support vertical text.
Of course I know that the new LSTM engine is very good ( in Japanese text
including English words especially).
In the meantime, maintaining the old engine provides the option of using
the old OCR engine only for vertical text.
c.f. #627 <#627> , #641
<#641>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#707 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056SgL-wwCYGswOMqxb7FNmr6OIYonks5rac68gaJpZM4L50TV>
.
--
Ray.
|
If 3.05 should be the last version with legacy OCR Engine (old engine) then there should be possibility to read OCR result from memory. Also it would be great if 3.05 and 4.0 version could be installed at the same time (AFAIK there are conflict with tessdata filenames: they are the same but they are not compatible) |
👍 for a side-by-side 3.05 and 4.00. A possible way to achieve this goal: |
I would prefer to be as much consistent as possible: e.g. if 3.02 and 3.04 use tessdata also 3.05 should. So 4.0 should start with change... |
Yes, if later we'll have 5.0 with different data files, they'll use |
I agree with zdenop
tessdata should be used for the 3.0x series, so as to not break any
existing use
New naming can be used for LSTM 4.0
ShreeDevi
…____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Feb 9, 2017 at 12:04 AM, Egor Pugin ***@***.***> wrote:
Yes, if later we'll have 5.0 with different data files, they'll use
tesseract5 and this won't break anything.
If we have tesseract for 4.0, then it will be renamed to tesseract4
again, and tesseract for 5.0 - that's not good.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#707 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_ozcExn8t5DzsPgjUAqnju9_CZk3Mks5ragqdgaJpZM4L50TV>
.
|
A simple solution could be using For the program names, we can look for existing examples. I just checked my |
I'm thinking of using the same traineddata file format for 4.0, but adding
some new subfiles, including a version string, as has been requested.
The LSTM-only engine would then store the unicharset, recoder and dawgs as
separate traineddata components, also satisfying the need to get at the
unicharset.
With an additional subfile to store the trainer-specific data, it should be
possible use the traineddata file format as a checkpoint format during
training, which gets rid of a layer of complexity.
I had thought of going with a different filename extension, but the
versioned subdir seems like a good idea too.
In any case, we should roll back the existing traineddata files for 3.05.
…On Wed, Feb 8, 2017 at 10:54 AM, Stefan Weil ***@***.***> wrote:
A simple solution could be using tessdata/4, tessdata/5 and so on for new
major versions, so we continue using a tessdata directory at the same
location as before, but automatically add the major version as the name of
a subdirectory. If Tesseract uses semantic versioning in the future, I see
no need to add a second number (although that would be possible, resulting
in tesseract/4.0).
For the program names, we can look for existing examples. I just checked
my /usr/bin/*[0-9] files and found names like clang-3.8, gcc-6, php5,
php-7.0, ruby2.1. So there is no clear convention whether to separate
name and version by a dash or not and whether to use major version only or
both major and minor version. With semantic versioning the major version
should be sufficient again.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#707 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AL056UcVHoxoH9rE_svO5yvu4FrJSrGVks5rag9PgaJpZM4L50TV>
.
--
Ray.
|
https://wiki.ubuntu.com/ZestyZapus/ReleaseSchedule Feb 16 is the final deadline for changess to Ubuntu 17.04. I am not comfortable shipping anything from 4.x to these users, but we can consider taking a snapshot of the 3.0.5 branch. It does have some bug and compatibility fixes that are good for users. Regarding training data, I would not ship an update that at all. This would be purely be a code update. I know the long standing issue has been restoring an API call (last seen in version 3.0.2) to send results to memory instead of file. I respect that idea, but we don't have it, and it's not that easy to add. I think it is fair to say that it would be impossible before deadline. So the question is, do we ship an update to users this cycle or not. And if so, should I take a snapshot? And if so, what would it be called? A few more thoughts that are somewhat related
|
@jbreiden Good idea to do a code update for 3.05 for Ubuntu 17.04. There are a number of bug fixes and changes and it would be good to get them out to the users. Thanks! |
@theraysmith: Here I try to provide examples of where you get better results with the old engine. I did a lot of LSTM training with OCRopus on real images of historical printings and noticed that LSTM recognition was inferior to classic Tesseract in these cases:
My explanation is, that single letters get decoded using the combined evidence of the whole line. If this is either rare and unusual (1, 2) or mostly absent (3), decoding is uncertain, no matter how clearly single glyphs are printed and preserved (and therefore easily recognized by methods based on feature detection). So I tried both the old (OEM = 0) and new (OEM = 1) recognizer on these 10 lines (the last line is a regular text line for comparison from a 1543 printing, where a trained model yields 99.5% accuracy for the book): Old method: tesseract -l lat --oem 0 --psm 7: 17: New method: tesseract -l lat --oem 1 --psm 7: 177: Admittedly, although this is all Latin text, the recognition looks much better without any language model (tesseract --oem 1 --psm 7): 17; But it still is less consistent than the old method in treating spacings. The last line shows the potential that may be reached when training on real images becomes available (long ſ, proper inter-word spacing model, historical glyphs). So I vote for keeping the old code just for these edge cases which are otherwise hard to recognize at the same level of consistency. |
Without explicit
is equivalent to:
|
@theraysmith commented in commit b453f74
|
I think that there is reason to keeping the old ocr engine while LSTM engine is not ideal. |
The problem is that the code for the old engine is too large and complex. As Ray indicated, keeping it will make improving the new LSTM engine much harder. |
So there should be changes in 4.0 code so tesseract 4.x and 3.05.x could be installed at the same time |
Yes, who wants old engine could use 3 series when lstm will be available in 4. |
single letters recognized better with legacy |
From #744
|
I can confirm all problems reported above by @uvius. In addition, some training files currently only exist for 3.x (notably deu_frak) or have a bad quality (deu), so 4.0 does not improve the results for those languages. I also had an example where a larger part of a page was missing in the output from LSTM while the old recognizer got most of that part correctly, but I am still searching to find that example again. |
As you know, unlike almost all the other files in the tessdata repo, the '_frak' traineddata files are not based on Google manpower (& machine-power) efforts. Maybe you and and your friends from @UB-Mannheim can prepare a new deu_frak traineddata for Tesseract 4.00 and share it under open source license (preferably Apache 2 or other permissive software license) ? |
Font size estimation is already supported for the lstm engine. |
@amitdo er, should have excluded size from there, I mean style. |
Did you noticed my other comment above (a link to Ray's comment) ? |
So it seems that the legacy engine will stay in final 4.0.0. I still have a question: Has someone done serious testing on hundreds of pages to compare the result of:
- Just the text renderer with default psm (auto, no osd). I'm interested in char and word error (CER & WER) statistics. |
I am just running accuracy tests on 189 pages from our historic books. Currently I have results from ABBYY Fine Reader and Tesseract with fast Fraktur and PSM 1, 3 and 6. I also tested ScanTailor + Tesseract fast Fraktur PSM 1. Best Fraktur is just running. First results for CER median: ABBYY: 10.5 % Detailed results will be published as soon as the tests are finished. |
lstm+legacy is currently not usable for mass production, because chances are high that Tesseract will fail because of the well known assertion. |
Thanks for sharing! What preprocessing options are used with ScanTailor? I think ABBYY also has 'fast' and 'accurate' modes. |
ScanTailor was used with ABBYY used the default mode with different language settings. I still have to look for effects caused by different handling of diacritica (for example Latin ground truth and ABBYY result without accents, but original text uses accents => Tesseract Fraktur detects accents). The raw data is at https://digi.bib.uni-mannheim.de/~stweil/anciendroit/new/. |
I decided to close this issue. There is now an option to compile Tessseract 4.0.0 without the legacy engine code. |
It's more than 64k LOC now. |
Can you run the accuracy tests again on the same dataset with master and/or latest tagged version, to make sure there is no regression? |
Here are the results with latest Tesseract for the line images posted by @uvius. They are still not perfect, but much better than the old ones, especially with a new model which I recently have trained (based on ground truth published by @uvius and others).
|
tesseract-5.0.0-alpha-592-gb221f has only few different results on the 189 files of my test set, but some of those are significantly better (-=old, +=new):
Git master produces different results, some of them slightly worse, some are better. The most significant change with latest Tesseract is the time required to process the 189 pages. It dropped from 1638 s to 926 s. I think that both effects are caused by commits cfb1fb2 and eaf72ac. |
With our latest model file and current Tesseract the median CER is reduced to 9 %. The execution time is now 635 s. That's much more than twice as fast compared to the old 4.0 results with much better accuracy. The bad news is that latest Tesseract does not detect any text on two of the 189 pages. That requires a closer examination.
|
How do these pages look like? I guess 470875348_0010.txt is a title page as page 0012 is the preface. Then it's the "title page problem" with maybe large letters in special design, extreme letterspacing etc. It gets better if I cut title pages into line images and the font style is similar to something trained. Title pages seem to be underrepresented in training sets. It's a sort of selection bias. |
See original JPEG files 470875348_0010 and 452117542_0250. |
Tried it with
and get for
and for
Page 470875348_0010.jpg looks also nice, but I didn't spend time for a GT.txt. Preprocessing issue? The remaining CER of ~4 % is still high for the good image quality and a very common typeface (Garamond like). |
The regression (empty page) was introduced by commit 5db92b2 and especially the modified declaration for Our GT data is available online: Tesseract output includes long s and some other characters which must be normalized before the comparision with that ground truth. |
That's not surprising because our training data had no focus on such typefaces. It used GT4HistOCR (mix from early prints until 19th century), Austrian newspapers (mainly Fraktur) and German primers from 19th century (Fraktur and handwritten script). Another and maybe even more dominant reason is the current quality of the ground truth texts. They were produced by a single person (no 2nd proof reader), and I just noticed that it also includes comments added by that person. In addition it includes numerous violations of our transcription guidelines, for examples blanks before comma and similar issues. So ground truth errors contribute to the CER. |
Ray wants to get rid of the legacy OCR engine, so that the final 4.00 version will only have one OCR engine based on LSTM.
From #518:
@stweil commented:
@theraysmith commented:
The text was updated successfully, but these errors were encountered: