Support for Extracting PDF Content as XML #35

coroluca · 2024-11-23T20:28:35Z

Hi, I’d like to use Extractous for my document processing tasks. I often need to extract PDF content as XML to retain structural information, such as page boundaries. This is a feature supported by Apache Tika, but it seems that currently, Extractous only provides plain text extraction.

Would it be possible to add support for XML extraction, similar to Tika’s functionality? This feature would be incredibly useful for preserving document structure.

Thank you for considering this request!

nmammeri · 2024-11-25T17:13:58Z

Thanks for reporting this, we are working on this. Will update this issue when we have a working implementation.

davidmezzetti · 2024-12-04T13:12:04Z

Thanks for mentioning this project to me over on Reddit. I'll definitely consider integrating it into txtai as another text extraction engine once this change is in.

I've long thought Tika is a good solution but the Java piece trips a lot of people up.

nmammeri · 2024-12-04T13:42:20Z

Thanks @davidmezzetti, we can definitely assist with the integration. It was always on our plan to work on integrations with other frameworks such as txtai. At the moment we are focusing on supporting most expected Tika features (including xml output). Then we can move onto integrations.
We'll get in touch once this is in ..

davidmezzetti · 2024-12-04T20:22:24Z

Sounds good. I should be able to add an integration fairly easily, ~20-30 lines with txtai once you have this change. I'll keep an eye on this!

coroluca · 2024-12-18T09:05:15Z

Hi @nmammeri,
do you have any updates on the progress of XML extraction support?
Thanks again for your work on this!

nmammeri · 2024-12-21T10:18:48Z

Hi @davidmezzetti and @coroluca I'm glad to announce that we finally got the xml output feature in. Please check version 0.3.0 🎉 . thanks for your patience

davidmezzetti · 2024-12-21T10:43:47Z

Great work! I'll take a look.

nmammeri added the enhancement New feature or request label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Extracting PDF Content as XML #35

Support for Extracting PDF Content as XML #35

coroluca commented Nov 23, 2024

nmammeri commented Nov 25, 2024

davidmezzetti commented Dec 4, 2024

nmammeri commented Dec 4, 2024

davidmezzetti commented Dec 4, 2024

coroluca commented Dec 18, 2024

nmammeri commented Dec 21, 2024

davidmezzetti commented Dec 21, 2024

Support for Extracting PDF Content as XML #35

Support for Extracting PDF Content as XML #35

Comments

coroluca commented Nov 23, 2024

nmammeri commented Nov 25, 2024

davidmezzetti commented Dec 4, 2024

nmammeri commented Dec 4, 2024

davidmezzetti commented Dec 4, 2024

coroluca commented Dec 18, 2024

nmammeri commented Dec 21, 2024

davidmezzetti commented Dec 21, 2024