Skip to content
This repository has been archived by the owner on Dec 9, 2018. It is now read-only.

Reflowable Text / Unwrapping Lines #56

Open
coolwanglu opened this issue Dec 11, 2012 · 12 comments
Open

Reflowable Text / Unwrapping Lines #56

coolwanglu opened this issue Dec 11, 2012 · 12 comments

Comments

@coolwanglu
Copy link
Owner

PDF was designed for printing, and it has limited support with devices with different sizes. But HTML is in another direction, which actually originates from a reflowable plain text stream.

This page shows the difficulties of recognizing and producing reflowable text for pdf2htmlEX. But maybe we can focus on the simplest cases with proper parameters for users.

This issue is created and left for discussion about this feature, please read the wiki page above before leaving message here.

@404pnf
Copy link

404pnf commented Dec 12, 2012

Thank you for trying this!

@coolwanglu
Copy link
Owner Author

@iapain, I remember that you have shown me a video for reflowing text in pdf, but I forgot the name. Could you please tell me?

@iapain
Copy link
Collaborator

iapain commented Oct 13, 2013

@coolwanglu
Copy link
Owner Author

As a start we may make some assumptions and start from the easiest case:

No header, footer, figure, table. And single column of text.

The task is to combine text lines in the same paragraph. We can further extend this to two-column layouts, as text rendering order is usually the same as reading order.

Spacing might be a problem, I think it's hard to preserve the exact spacing (width of letter-space, word-space, space-char), need to relax and estimate them.

The process should be easy if facilitated with manual marking/adjustments.

Ref: http://scribdtech.wordpress.com/2012/02/29/why-zooming-on-mobile-is-broken-and-how-to-fix-it/

Another direction is to tag the PDF in some way, and to utilize tag information in PDF file.

@coolwanglu
Copy link
Owner Author

Lots of useful information here:
http://wiki.mobileread.com/wiki/PDF

@alvis
Copy link

alvis commented Oct 30, 2013

For text reflowing, you may consider the approach of K2pdfopt. It is an open-source tool which convert pdf file into different page size with reflowed text.

http://www.willus.com/k2pdfopt

@coolwanglu
Copy link
Owner Author

@alvisty Thanks for your information!

It seems to have been actively developed, and it looks promising. I'll check it out and see how it works.

@alvis
Copy link

alvis commented Oct 30, 2013

An indian PDF converter (Aiox) seems to be able to convert document to ePub perfectly (with OCR & some manual input). https://www.youtube.com/watch?v=qC1hwJ8KFL8

Infty reader is even able to convert math equation to LaTex/MathML format.
https://www.youtube.com/watch?v=PHDZEjwWjx0
http://www.inftyproject.org/en/demo.html#0002

Unfortunately, both of them are proprietary. Sources are hence unavailable.
Yet, at least technically speaking, complete reflowing (even with math & table) is achievable.
Also, both of them use OCR as input. It help to identify the relationship between texts.

@coolwanglu
Copy link
Owner Author

I wouldn't say "reflowing is achivevable" based on these videos. For example in the first video, the page box is (or maybe has to be) provided by manually drawing it. The document seems to be well formed, single-column, the paragraphs are quite aligned (no item lists etc), and I didn't see any tables in the video.

But I do believe that it's doable with some assumptions on the document type and format.

The OCR example is indeed amazing, but I couldn't try my own files due to maintenance.

@coolwanglu
Copy link
Owner Author

@iapain k2pdfopt seems to be image based, I'll try to read more of its code later. For now I'll try to create a new text model for the reflowable text in the future. The new model will be somewhat like Crocodoc's, which consists of text groups with relative positioned text lines.

@matt212
Copy link

matt212 commented May 22, 2018

any update on same i objective is same as #46
just want structure html no css !

@sadovnychyi
Copy link

Maybe anyone is interested in working on this for a bounty? I'm looking for a solution where resulting HTML contains proper paragraph structure, center/right aligned elements are positioned with respective CSS attributes, etc – basically no absolutely positioned elements, which seems what reflowable text is about.
I'm wiling to pay for this issue to be resolved, and I'm pretty sure we all could benefit from that.

ping @coolwanglu

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants