-
Notifications
You must be signed in to change notification settings - Fork 9
Non text extraction
It would be useful to extract some further info from a PDF than just text information:
If drawn lines could be extracted, they would be a very useful for recognizing figures, which otherwise would just yield several quite random ParagraphNodes with content it would be best to separate entirely from running text.
To accomplish this something have to be done with the current super class of Pdf2Xml class, which does not implement all necessary PDF operators. Other super classes are available (PageDrawer for example contains this functionality), but their functionalities are not entirely overlapping. Either some merging of code (most likely in a new class contained in this repository), or implementation of the needed operators would be necessary
The entire contents of texts written with a custom encoding can only be exactly retrieved by means of OCR. Knowing when this happens, and for which portions of the text would be very useful
This could be useful for some export targets, for example a PDF-ODF converter. Wont spend any time on this now.