How can I improve ocropus accuracy? #296

IgorMunizS · 2018-02-21T12:51:26Z

Hi,
I'm facing 2 problems:

1 - I need to use Ocropy to extract text from documents in Portuguese.
So far, I have added the required characters in char.py and I am training (with a previously trained model) the network based on this: https://github.com/tmbdev/ocropy/wiki/Working-with-Ground-Truth.

2 - I know about document quality restrictions (300 dpi), but some images that I have are bad scans. I've tried the same images in other APIs (like Google Vision) and got better results, but I liked ocropy.
I'm wondering if there are some preprocess techniques that can improve the results.

So, what can I do? What is the best way to generate data for training ocropy network?
Edit: ocropy training supports multithreading?
Thanks!

Python version:
Python 2.7.14 :: Anaconda, Inc.
Git revision of ocropy:
commit e9b6121
Merge: 43381c4 289a58f
Author: Konstantin Baierer [email protected]
Date: Mon Feb 19 19:24:12 2018 +0100

Merge pull request ocropus-gpageseg: Enable usage of masks to specify column separators/ ignore areas of scan #236 from lehzwo/master

ocropus-gpageseg: Enable usage of masks to specify column separators/ ignore areas of scan
Operating System and version:
Linux ubuntu-virtual 4.10.0-28-generic doc #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

zuphilip · 2018-02-21T22:57:39Z

How does your confusions look like currently, i.e. ocropus-econf? In general it is hard to say what can improve the accuracy. Can you share here 2-3 of your documents?

IgorMunizS · 2018-02-22T13:00:07Z

The confusion in test data is:
errors 233
missing 0
total 4894
err 4.761 %
errnomiss 4.761 %
28 ÇÆ çã
15 8 S
14 Ä á
13 Æ ã
12 Ë í
11 Ï ó
7 È é
7 0A ÇÃ
5 ÇÔ çõ
5 , .
0.0476093175317

I left my model training all night with portuguese texts and images generated by ocropus-linegen. the training error is decreasing, but the test error is worse than the default model (en version).
Last 4 test errors:
0.04298535663675012
0.050070854983467174
0.0547945205479452
0.05550307038261691
It's currently in 19000 iterations.
I'll see if I can share some files and comeback here.
Thanks for your reply!

Edit:
Files:

These are good files. For now, I'm trying to get better results with portuguese characters and not worrying about the quality.

zuphilip added the ❔ question label Feb 21, 2018

ricardobnjunior mentioned this issue Aug 28, 2018

Doubt about OCRopus IgorMunizS/PedestrianDetection#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I improve ocropus accuracy? #296

How can I improve ocropus accuracy? #296

IgorMunizS commented Feb 21, 2018 •

edited

Loading

zuphilip commented Feb 21, 2018

IgorMunizS commented Feb 22, 2018 •

edited

Loading

How can I improve ocropus accuracy? #296

How can I improve ocropus accuracy? #296

Comments

IgorMunizS commented Feb 21, 2018 • edited Loading

zuphilip commented Feb 21, 2018

IgorMunizS commented Feb 22, 2018 • edited Loading

IgorMunizS commented Feb 21, 2018 •

edited

Loading

IgorMunizS commented Feb 22, 2018 •

edited

Loading