Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I improve ocropus accuracy? #296

Open
IgorMunizS opened this issue Feb 21, 2018 · 2 comments
Open

How can I improve ocropus accuracy? #296

IgorMunizS opened this issue Feb 21, 2018 · 2 comments

Comments

@IgorMunizS
Copy link

IgorMunizS commented Feb 21, 2018

Hi,
I'm facing 2 problems:

1 - I need to use Ocropy to extract text from documents in Portuguese.
So far, I have added the required characters in char.py and I am training (with a previously trained model) the network based on this: https://github.com/tmbdev/ocropy/wiki/Working-with-Ground-Truth.

2 - I know about document quality restrictions (300 dpi), but some images that I have are bad scans. I've tried the same images in other APIs (like Google Vision) and got better results, but I liked ocropy.
I'm wondering if there are some preprocess techniques that can improve the results.

So, what can I do? What is the best way to generate data for training ocropy network?
Edit: ocropy training supports multithreading?
Thanks!

@zuphilip
Copy link
Collaborator

How does your confusions look like currently, i.e. ocropus-econf? In general it is hard to say what can improve the accuracy. Can you share here 2-3 of your documents?

@IgorMunizS
Copy link
Author

IgorMunizS commented Feb 22, 2018

The confusion in test data is:
errors 233
missing 0
total 4894
err 4.761 %
errnomiss 4.761 %
28 ÇÆ çã
15 8 S
14 Ä á
13 Æ ã
12 Ë í
11 Ï ó
7 È é
7 0A ÇÃ
5 ÇÔ çõ
5 , .
0.0476093175317

I left my model training all night with portuguese texts and images generated by ocropus-linegen. the training error is decreasing, but the test error is worse than the default model (en version).
Last 4 test errors:
0.04298535663675012
0.050070854983467174
0.0547945205479452
0.05550307038261691
It's currently in 19000 iterations.
I'll see if I can share some files and comeback here.
Thanks for your reply!

Edit:
Files:
output-0
output-1
These are good files. For now, I'm trying to get better results with portuguese characters and not worrying about the quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants