Skip to content

Non text extraction

elacin edited this page Sep 14, 2010 · 1 revision

It would be useful to extract some further info from a PDF than just text information:

Drawings / Lines

If drawn lines could be extracted, they would be a very useful for recognizing figures, which otherwise would just yield several quite random ParagraphNodes with content it would be best to separate entirely from running text.

To accomplish this something have to be done with the current super class of Pdf2Xml class, which does not implement all necessary PDF operators. Other super classes are available (PageDrawer for example contains this functionality), but their functionalities are not entirely overlapping. Either some merging of code (most likely in a new class contained in this repository), or implementation of the needed operators would be necessary

Incomplete embedded fonts

The entire contents of texts written with a custom encoding can only be exactly retrieved by means of OCR. Knowing when this happens, and for which portions of the text would be very useful

Extraction of Embedded images

This could be useful for some export targets, for example a PDF-ODF converter. Wont spend any time on this now.

Anything else?

Clone this wiki locally