Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect font style attributes #305

Closed
alexgg94 opened this issue Oct 16, 2020 · 3 comments
Closed

Detect font style attributes #305

alexgg94 opened this issue Oct 16, 2020 · 3 comments

Comments

@alexgg94
Copy link

Hi,
I'm using image_to_data method to extract text from images and organize it into a pandas dataframe.
I was just wondering if there is any way to make image_to_data method also give me information about the word font style attributes (bold, italic, font size...).

Cheers

@bozhodimitrov
Copy link
Collaborator

HI @alexgg94
Is it possible for tesseract itself to report this information via additional option or configuration?
Because pytesseract is just a wrapper and if tesseract doesn't support this option itself, it will be a lot harder to implement additional logic to provide this attributes.

Another downside could be, that extracting additional information like font attributes may require executing tesseract multiple times, which can make image_to_data very slow.

@alexgg94
Copy link
Author

Hi,

Tesseract itself can provide this logic by using WordFontAttributes method defined in ResultIterator.
I agree that this functionality could penalize time wise, but if somehow, it could be used along side image_to_data it would be so cool.

Just wondering if pytesseract right now can offer this feature or if it has been considered somehow.

PD. image_to_data makes my life so much easier, that's why I'm asking 😄

Cheers

@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Oct 16, 2020

It seems that there is a problem with WordFontAttributes in the new engine: tesseract-ocr/tesseract#1074
A.k.a - this option might be available only with the old (original) Tesseract engine (config option --oem 0)

You can also try to use the more advanced wrapper tesserocr, but I am not sure if it supports it either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants