Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is ocr_column obsolete? #76

Open
kba opened this issue Oct 22, 2016 · 6 comments
Open

Why is ocr_column obsolete? #76

kba opened this issue Oct 22, 2016 · 6 comments

Comments

@kba
Copy link
Owner

kba commented Oct 22, 2016

No description provided.

@amitdo
Copy link
Collaborator

amitdo commented Oct 22, 2016

To answer that question, we need to understand ocr_carea...

@zuphilip
Copy link
Collaborator

... which is also part of the issue #28.

Maybe, it was argued that columns cannot be style-independent and therefore there cannot be a ocr_* property, i.e. in a LaTeX (source) document I just see the text content and the splitting into several columns is done in the rendering phase. (guesswork^^)

@kba
Copy link
Owner Author

kba commented Oct 22, 2016

Generally, I'd also favor being careful with document-level semantics but nesting ocr_carea for everything is also problematic. At least, it should be easy to differentiate between print space and connected boxes (like columns, paragraphs) below that level. ALTO has PrintSpace, ComposedBlock, TextBlock for these purposes.

@zuphilip
Copy link
Collaborator

I agree that we may change this things somehow in the future. I could envision that on a page there are different (text) content areas, which might discontinued by a picture or some other special content. Moreover, there can be special (text) content areas as the footer, header or a marginalia. A (text) content area can then contain several columns, which may be divided into several text blocks. However, I don't know how "near" this is to what the specs are currently saying.

@kba
Copy link
Owner Author

kba commented Oct 22, 2016

I agree that we may change this things somehow in the future. I could envision that on a page there are different (text) content areas, which might discontinued by a picture or some other special content.

That's the intention of cflow/ocr_linear I think.

@zuphilip
Copy link
Collaborator

I imagine ocr_linear in the same way as <article> in HTML, where everything inside (maybe excluding some floats) has a reading order but the ocr_linears itself may not have a canonical reading order. The property cflow I don't understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants