-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Extracting PDF Content as XML #35
Comments
Thanks for reporting this, we are working on this. Will update this issue when we have a working implementation. |
Thanks for mentioning this project to me over on Reddit. I'll definitely consider integrating it into txtai as another text extraction engine once this change is in. I've long thought Tika is a good solution but the Java piece trips a lot of people up. |
Thanks @davidmezzetti, we can definitely assist with the integration. It was always on our plan to work on integrations with other frameworks such as txtai. At the moment we are focusing on supporting most expected Tika features (including xml output). Then we can move onto integrations. |
Sounds good. I should be able to add an integration fairly easily, ~20-30 lines with txtai once you have this change. I'll keep an eye on this! |
Hi @nmammeri, |
Hi @davidmezzetti and @coroluca I'm glad to announce that we finally got the xml output feature in. Please check version 0.3.0 🎉 . thanks for your patience |
Great work! I'll take a look. |
Hi, I’d like to use Extractous for my document processing tasks. I often need to extract PDF content as XML to retain structural information, such as page boundaries. This is a feature supported by Apache Tika, but it seems that currently, Extractous only provides plain text extraction.
Would it be possible to add support for XML extraction, similar to Tika’s functionality? This feature would be incredibly useful for preserving document structure.
Thank you for considering this request!
The text was updated successfully, but these errors were encountered: