Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German - Characters added to result multiple times (aä / AÄ) #1060

Open
TheSeiko opened this issue Aug 1, 2017 · 41 comments
Open

German - Characters added to result multiple times (aä / AÄ) #1060

TheSeiko opened this issue Aug 1, 2017 · 41 comments

Comments

@TheSeiko
Copy link

TheSeiko commented Aug 1, 2017

tesseract 4.00.00alpha
leptonica-1.74.1
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
Win10 64bit - built Uni Mannheim

deu.traineddata - Repeating of characters:

Current Behavior:

ÄGYPTEN -> ÄAGYPTEN
Grand-Prix -> Gräand-Prix
AUSTRALIEN -> AUSTRAÄLIEN
GROSSBRITANNIEN -> GROSSBRITAÄANNIEN

Expected Behavior:

ÄGYPTEN -> ÄGYPTEN
Grand-Prix -> Grand-Prix
AUSTRALIEN -> AUSTRALIEN
GROSSBRITANNIEN -> GROSSBRITANNIEN

Suggested Fix:
1 blob / 1 box should only be 1 outcome / 1 result

Additional Info:
Example images are available for posting

@TheSeiko
Copy link
Author

TheSeiko commented Aug 1, 2017

Additional example (wW)
VW-Werk -> VwW-Werk

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Aug 1, 2017 via email

@amitdo
Copy link
Collaborator

amitdo commented Aug 1, 2017

Suggested Fix:
1 blob / 1 box should only be 1 outcome / 1 result

  1. It won't work with ligatures.
  2. With the legacy OCR engine, there is a character segmentation step, and the OCR is done on individual char blobs.
    With the new LSTM engine, the OCR is done by the neural network on sequence of pixels in text lines, not on pre-segmented blobs.

@amitdo
Copy link
Collaborator

amitdo commented Aug 1, 2017

The fix for most problems with the LSTM engine is more / better training.

DAS2016 Sildes, 6. Modernization Efforts Page 17
Encyclopedia -> EE-n-c-yy-c-l-o-p-e-d-i-a -> Encyclopedia

I think that for 'in dictionary' words these kind of duplications would be eliminated.

@amitdo
Copy link
Collaborator

amitdo commented Aug 1, 2017

Similar issues: #884 #1011

@TheSeiko
Copy link
Author

TheSeiko commented Aug 2, 2017

@Shreeshrii

https://github.com/tesseract-ocr/tessdata/tree/master/best
is not working @ALL

deu.traineddata 19.721 KB
best - deu.traineddata 8.427 KB

best trainingdata only delivers empty results

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Aug 2, 2017

Are you using --oem 1?

you can see the contents of the traineddata by

combine_tessdata -u deu,traineddata

These are probably only lstm models and do not have the legacy engine which is used via --oem 0

@Shreeshrii
Copy link
Collaborator

@stweil Have you tested the deu model?

@TheSeiko
Copy link
Author

TheSeiko commented Aug 2, 2017

Yes, I'm using --oem 1

I'm just switching deu.traineddata in tessdata
Old one works without problems, new one -> empty output
Checked the download - downloaded File has same size as in the repository.

@Shreeshrii
Copy link
Collaborator

@stweil - you may need to update the windows binaries on Uni Mannheim site with the latest updates from Ray.

@TheSeiko I haven't personally tested the deu model. WIll check and post result. Wondering whether your Windows binary is old....

@Shreeshrii
Copy link
Collaborator

Looks like you need both deu and frk models

wget -O ./tess4data-save/deubest.traineddata https://github.com/tesseract-ocr/tessdata/blob/master/best/deu.traineddata?raw=true

sudo cp ./tess4data-save/*.traineddata /usr/share/tesseract-ocr/4.00/tessdata

time tesseract ./tif/phototest.tif stdout --oem 1 -l deu
time tesseract ./tif/phototest.tif stdout --oem 1 -l deubest

Page 1
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.
real	0m1.633s
user	0m2.032s
sys	0m0.492s
Error opening data file /usr/share/tesseract-ocr/4.00/frk.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'frk'
Page 1
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
Jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.
real	0m2.045s
user	0m2.744s
sys	0m0.612s

@Shreeshrii
Copy link
Collaborator

works on linux - looks for frk traineddata, probably listed in deu.config

@stweil
Copy link
Contributor

stweil commented Aug 2, 2017

Have you tested the deu model?

The new best one? No, I have not tested it yet. I am currently focused on Fraktur where the new results clearly beat the old ones.

[...] update the windows binaries on Uni Mannheim site [...]

I noticed on Linux that "old" Tesseract executables crash with the new traineddata, so I expect that my current Windows binaries would crash, too. Building new ones is on my list.

@TheSeiko
Copy link
Author

TheSeiko commented Aug 2, 2017

thank you

@stweil
Copy link
Contributor

stweil commented Aug 4, 2017

[...] update the windows binaries on Uni Mannheim site [...]

The new binaries are now available.
I now use semantic versioning, so this is my 4.0.0-alpha.20170804.

@Shreeshrii
Copy link
Collaborator

Thanks!

I now use semantic versioning, so this is my 4.0.0-alpha.20170804.

:-)

@TheSeiko
Copy link
Author

TheSeiko commented Aug 4, 2017

Thank you for the new binaries.

There are still similar errors:

hitzefrei -> 1 x hitzefreii / 1 x hitzefreil

Suggestion: The results are a lot better with 4.0. LSTM than with 3.05.01 but training seams to be difficult. Maybe it would be a good idea to offer a webpage where people could upload example image-files and matching text-files to include them in the training process.

@stweil
Copy link
Contributor

stweil commented Aug 4, 2017

looks for frk traineddata, probably listed in deu.config

@theraysmith, best/deu.traineddata includes a deu.config with tessedit_load_sublangs frk. Why was this dependency added? It is confusing for end users who want to use -l deu that they need frk.traineddata, too.

@TheSeiko, maybe you'd get better results for Antiqua text without that frk dependency (which might be good for texts which also include Fraktur). You can use combine_tessdata to extract the components of best/deu.traineddata, remove deu.config and combine the remaining components again in a new file.

@TheSeiko
Copy link
Author

TheSeiko commented Aug 4, 2017

Thank you for the tip. Much appreciated!

@TheSeiko
Copy link
Author

TheSeiko commented Aug 7, 2017

@stweil Am I doing something wrong?

There's only a version file included in the deu,traineddata when using the binaries from 04.08

E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu
Extracting tessdata components from deu.traineddata
Wrote tmp/deu.version
Version string:4.0.0-alpha.20170804
23:version:size=20, offset=192

E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu.
Extracting tessdata components from deu.traineddata
Wrote tmp/deu.version
Version string:4.0.0-alpha.20170804
23:version:size=20, offset=192

E:\Tesseract-OCR4.0a2>combine_tessdata -d deu.traineddata
Version string:4.0.0-alpha.20170804
23:version:size=20, offset=192

@theraysmith
Copy link
Contributor

theraysmith commented Aug 7, 2017 via email

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Aug 7, 2017 via email

@theraysmith
Copy link
Contributor

theraysmith commented Aug 7, 2017 via email

@Shreeshrii
Copy link
Collaborator

It seems the majority of the problems are lack of sync of code/data. There
are dependencies between code and data that have changed due to moving the
unicharset from the LSTM model to the traineddata file.

Yes. That is the problem.

One possible solution that I have been asking for a while is the tagging of "important" commits. Then it would be easy to say, use tesseract, tessdata, langdata as of 4.0.0alpha-20170807

@TheSeiko
Copy link
Author

TheSeiko commented Aug 8, 2017

@stweil thank you, removing deu.config helped a lot


ad best traineddata deu without deu.config:

after ~50k testimages: great recognition rate

only problem so far: sometimes i is not recognised properly:

sıch - sich
Parıs - Paris

I'm adding a regex to replace ı with i

@TheSeiko
Copy link
Author

TheSeiko commented Aug 8, 2017

and j -> J

OCR Result <-> Text in image
Jungen - jungen
Jury - jury
Juries - juries

@TheSeiko
Copy link
Author

TheSeiko commented Aug 9, 2017

$$-Jährige &lt;-&gt; $$-jährige
SPO - SPÖ

@theraysmith
Copy link
Contributor

theraysmith commented Aug 10, 2017 via email

@TheSeiko
Copy link
Author

TheSeiko commented Aug 11, 2017

@theraysmith jährige/Jährige can be both - a noun (capital letter) or an adjective (lowercase):
a 42 year old man - ein 42-jähriger Mann
a 42 year old - ein 42-Jähriger

Latin is working better with this problem, I've had it running yesterday for ~100k frames
Latin has some problems with mutated vowels.
i.e.:
+------------+---------+--------------------+--------------------+---------------------+
| languageId | ranking | replaceTo | replaceRegex | inputDate |
+------------+---------+--------------------+--------------------+---------------------+
| 10 | 10 | Österreich | Osterreich | 2017-08-03 14:45:05 | - DEU without FRAK
| 10 | 10 | Paris | Parıs | 2017-08-08 10:52:04 | - DEU without FRAK
| 10 | 10 | i | ı | 2017-08-08 13:04:50 | - DEU without FRAK
| 10 | 10 | ÖFB-Goalie | OFB-Goalie | 2017-08-08 14:20:24 | - DEU without FRAK
| 10 | 10 | Volkspartei | Volksparte!l | 2017-08-09 09:14:11 | - LATIN
| 10 | 10 | Eurofighter-Übung | Eurofighter-Ubung | 2017-08-09 09:34:09 | - LATIN
| 10 | 10 | Überlebende | Uberlebende | 2017-08-10 08:08:04 | - LATIN
| 10 | 10 | Eine | Fine | 2017-08-10 09:04:31 | - LATIN
| 10 | 10 | Oberwölz | Oberwòölz | 2017-08-10 10:31:30 | - LATIN
| 10 | 10 | Wörter | Wõōrter | 2017-08-10 14:25:23 | - LATIN
| 10 | 10 | Wörter | Wōörter | 2017-08-10 14:25:51 | - LATIN
| 10 | 10 | Wörter | Wōrter | 2017-08-10 14:26:45 | - LATIN
| 10 | 10 | Männer | Māänner | 2017-08-10 15:04:25 | - LATIN
+------------+---------+--------------------+--------------------+---------------------+

I've collected some example images and I'll try to do the "Fine Tuning Training"

@TheSeiko
Copy link
Author

| 10 | 10 | Arzl-Ost | Arzl-0Ost | 2017-08-11 09:34:41 | - LATIN
| 10 | 10 | Ein | Fin | 2017-08-11 09:35:26 | - LATIN
| 10 | 10 | Während | Wāährend | 2017-08-11 09:37:20 | - LATIN
| 10 | 10 | Oscarprämierter | Oscarprāmierter | 2017-08-11 10:02:07 | - LATIN

@TheSeiko
Copy link
Author

| 10 | 1502045216726 | Oberwölz | Oberwõlz
| 10 | 1502047625611 | Militärbasis | Militārbasis
| 10 | 1502057831054 | www.uncut.at | WWWw.uncut.at
| 10 | 1502099066258 | Wörter | Wõörter
| 10 | 1502269194006 | Donaupark zum Gratis | Donaupark zum 6Gratis

@stweil
Copy link
Contributor

stweil commented Nov 9, 2020

@TheSeiko, do you have example images which still show this issue? We need them to test a bug fix which was suggested in #3144.

@TheSeiko
Copy link
Author

@stweil
I'll post some example images asap. It got a lot better but still happens.

@TheSeiko
Copy link
Author

One thing I've found out is that sometimes the points, i.e. Ö are used with the previous line. So the Ö is recognised as an O and the previous line has points added. Sometimes the reason for this is that the previous line has a different character size than the following paragraph. But this is only one case.

@amitdo
Copy link
Collaborator

amitdo commented Nov 15, 2020

One thing I've found out is that sometimes the points, i.e. Ö are used with the previous line. So the Ö is recognised as an O and the previous line has points added. Sometimes the reason for this is that the previous line has a different character size than the following paragraph. But this is only one case.

This looks like a different issue from the original one.

@TheSeiko
Copy link
Author

C:\Tesseract-OCR20200328>tesseract --version
tesseract v5.0.0-alpha.20200328
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5
Found libcurl/7.59.0 OpenSSL/1.0.2o (WinSSL) zlib/1.2.11 WinIDN libssh2/1.7.0 nghttp2/1.31.0


C:\Tesseract-OCR20190314>tesseract --version
tesseract v4.0.0.20190314
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0
Found AVX2
Found AVX
Found SSE

@TheSeiko
Copy link
Author

pathTesseract: C://Tesseract-OCR20190314/tesseract
FrameType: CENTER_SMALL_COLOR
ForDeleting: false
FrameColor: BLUE_023_060_211

Österreich
Jeder dritte Fuf&gaánger-Unfall in Osterreich

passiert auf den Schutzwegen.

pathTesseract: C://Tesseract-OCR20200328/tesseract
FrameType: CENTER_SMALL_COLOR
ForDeleting: false
FrameColor: BLUE_023_060_211

Österreich
Jeder dritte Fußgänger-Unfall in Osterreich

passiert auf den Schutzwegen.

20201125160529375_1598603314737_bottom

@TheSeiko
Copy link
Author

@stweil - grófste

pathTesseract: C://Tesseract-OCR20200328/tesseract
FrameType: RIGHT_0970_COLOR
ForDeleting: false
FrameColor: CYAN_008_107_102

Fjorde

Der Kangertittivaq ist
das grófste Fjord-
system der Welt.

Der längste Fjord
erstreckt sich über
fast 350 Kilometer an
Grönlands Ostküste.

20201125170041216_1577519184094_main

@TheSeiko
Copy link
Author

@stweil ÓFB-Legionaàr

pathTesseract: C://Tesseract-OCR20200328/tesseract
FrameType: CENTER_WHITE
ForDeleting: false
FrameColor: WHITE

Premier League
Die ,Reds" haben weiter eine makellose
Bilanz und sind klarer Tabellenführer.

Bei Watford feiert ÓFB-Legionaàr Pródl beim
0:0 gegen Sheffield den ersten Liga-Einsatz
seit einem Jahr. Watford bleibt weiter

20201125180010431_1570300709269_main

@TheSeiko
Copy link
Author

@stweil A^4

pathTesseract: C://Tesseract-OCR20200328/tesseract
FrameType: CENTER_SMALL_COLOR
ForDeleting: false
FrameColor: BLUE_023_060_211

Ungarn/Üsterreich
Lebenslang für die vier Hauptangeklagten

nach dem A^4-Flüchtlingsdrama.

20201125185627625_1561026929372_bottom

@TheSeiko
Copy link
Author

@stweil 4 comes from nowhere

pathTesseract: C://Tesseract-OCR20200328/tesseract
FrameType: LEFT_1080_COLOR
ForDeleting: false
FrameColor: RED_215_023_020

Kinoabende
Architektur und
Urbanismus in Zeiten
des Klimawandels.

4

„Die Zukunft reparieren‘
ist das Thema dieses
Architekturfilmfestivals.

20201125190031442_1566254197158_main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants