-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract tables from PDF to CSV using Tabula #2312
Extract tables from PDF to CSV using Tabula #2312
Conversation
- Add tabula-java dependency to extract tables into CSV. - Exclude slf4j-simple due to Logback
- Add FlexibleCSVWriter which extends CSVWriter to pass a custom CSVFormat, as CSVWriter's parameterized constructor (that allows changing CSVFormat) is protected.
- Use Tabula in extracting tables from PDF instead of the existing implementation
- Delete PDFTableStripper as It is unneeded as Tabula-Java is used instead.
Can you add and see if it still works? |
also did this improve PDF to CSV conversions? any ideas? |
Sure, I will give it a try and inform you afterwards
When I tested it I believe it improved the conversion as the old version would work correctly for some PDF files and fail to work some other files (it would either extract the entire table and place on 1 row - column by column - or would only extract the first 4 values in the first column and also place it in the same line, but in other cases it worked fine), as for tabula I attempted to use it on multiple files including the ones that the old version failed to work on, and it worked correctly/well |
I added the excluded dependencies and added gson as you mentioned
but this caused testMainApplicationStartup() in stirling.software.SPDF.SPDFApplicationTest to fail with the following exception Anyway, so I decided to attempt to run the application without
and it also worked pretty well just like before, I guess this way we would be over the security vulnerabilities mentioned earlier. (I will commit it and push it) If merged I believe this might also close #2255 (when the remote/demo server is updated) Either way, I found a bug in the UI when trying to convert from PDF to CSV, but I believe I would have to post a separate issue for it |
- Exclude gson and bcprov-jdk15on from tabula-java due to detected security vulnerabilities.
Description
Closes #1614
Checklist