Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating ALTO [enhancement] #419

Closed
FoxKyong opened this issue Sep 6, 2016 · 7 comments
Closed

Creating ALTO [enhancement] #419

FoxKyong opened this issue Sep 6, 2016 · 7 comments
Labels
feature request output issues related output formats

Comments

@FoxKyong
Copy link

FoxKyong commented Sep 6, 2016

Are there any plans for creating ALTO support in tesseract? I was thinking about programming a module for it. I was searching some conversion tool but found nothing working. I need it to be working in linux terminal, I found just conversion tool from hOCR to ALTO but the output is wrong. It would be also better if tesseract would generate ALTO.

@stweil
Copy link
Contributor

stweil commented Sep 6, 2016

@FoxKyong
Copy link
Author

FoxKyong commented Sep 6, 2016

Yes I have and the generated ALTO isn't valid. It can't be imported to the software I use and also the validator ocr-validate says it's not valid.

@stweil
Copy link
Contributor

stweil commented Sep 6, 2016

Could you please create an issue for that project then and add more details about the problems which you encountered?

@jbreiden
Copy link
Contributor

jbreiden commented Sep 6, 2016

A contribution of direct ALTO support would be welcome. Recommend implementing in a separate file api/altorenderer.cpp rather than adding to api/baseapi.cpp.

@stweil
Copy link
Contributor

stweil commented Nov 30, 2018

There is now initial support for ALTO output in latest Git master. I keep this issue open nevertheless until issue altoxml/schema#54 was solved and more testing with Tesseract + ALTO was done.

@Shreeshrii
Copy link
Collaborator

@stweil ResultIteratorTest shows that tesseract can identify superscripts, subscripts, small caps and drop caps.

Alto schema seems to support these - see
https://github.com/kermitt2/pdfalto/blob/master/schema/alto.xsd#L762

Does tesseract ALTO output include such Font styles?

@amitdo
Copy link
Collaborator

amitdo commented Mar 16, 2022

This feature request was implemented a long time ago.

@amitdo amitdo closed this as completed Mar 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request output issues related output formats
Projects
None yet
Development

No branches or pull requests

6 participants