Extract tables from PDF to CSV using Tabula #2312

omar-ahmed42 · 2024-11-23T14:32:39Z

Description

Add Tabula dependency to use in extracting tables from PDF files.
Exclude slf4j-simple from tabula-java dependency as it caused errors in the project when used along with Logback.
Add FlexibleCSVWriter to allow using different CSVFormat rules, such as quoting all values (i.e: "1", "my value", "my 2nd column value") as CSVWriter's parameterized constructor is protected (thus, we couldn't customize the CSVFormat).
Utilize Tabula in extracting the table and its values and writing them in CSV format.
Delete PDFTableStripper as it is no longer needed (Tabula is sufficient).
Use correct class in the logger present in ExtractCSVController (from CropController.class -to-> ExtractCSVController.class).

Checklist

I have read the Contribution Guidelines
I have performed a self-review of my own code
I have attached images of the change if it is UI based
I have commented my code, particularly in hard-to-understand areas
If my code has heavily changed functionality I have updated relevant docs on Stirling-PDFs doc repo
My changes generate no new warnings
I have read the section Add New Translation Tags (for new translation tags only)

- Add tabula-java dependency to extract tables into CSV. - Exclude slf4j-simple due to Logback

- Add FlexibleCSVWriter which extends CSVWriter to pass a custom CSVFormat, as CSVWriter's parameterized constructor (that allows changing CSVFormat) is protected.

- Use Tabula in extracting tables from PDF instead of the existing implementation

- Delete PDFTableStripper as It is unneeded as Tabula-Java is used instead.

Frooodle · 2024-11-23T22:11:03Z

hmmm need to think about this a lil more

Frooodle · 2024-11-23T22:13:15Z

Can you add
implementation ('technology.tabula:tabula:1.0.5') {
exclude group: "org.slf4j", module: "slf4j-simple"
exclude group: "org.bouncycastle", module: "bcprov-jdk15on"
}
implementation 'com.google.code.gson:gson:2.8.9'

and see if it still works?

Frooodle · 2024-11-23T22:13:27Z

also did this improve PDF to CSV conversions? any ideas?

omar-ahmed42 · 2024-11-23T22:23:24Z

Can you add implementation ('technology.tabula:tabula:1.0.5') { exclude group: "org.slf4j", module: "slf4j-simple" exclude group: "org.bouncycastle", module: "bcprov-jdk15on" } implementation 'com.google.code.gson:gson:2.8.9'

and see if it still works?

Sure, I will give it a try and inform you afterwards

also did this improve PDF to CSV conversions? any ideas?

When I tested it I believe it improved the conversion as the old version would work correctly for some PDF files and fail to work some other files (it would either extract the entire table and place on 1 row - column by column - or would only extract the first 4 values in the first column and also place it in the same line, but in other cases it worked fine), as for tabula I attempted to use it on multiple files including the ones that the old version failed to work on, and it worked correctly/well

omar-ahmed42 · 2024-11-23T23:13:57Z

I added the excluded dependencies and added gson as you mentioned

implementation ('technology.tabula:tabula:1.0.5') {
exclude group: "org.slf4j", module: "slf4j-simple"
exclude group: "org.bouncycastle", module: "bcprov-jdk15on"
}
implementation 'com.google.code.gson:gson:2.8.9'

but this caused testMainApplicationStartup() in stirling.software.SPDF.SPDFApplicationTest to fail with the following exception "java.lang.IllegalStateException: java.lang.NoClassDefFoundError: com/google/gson/Strictness" (I could provide you with the full stack trace if required).

Anyway, so I decided to attempt to run the application without implementation 'com.google.code.gson:gson:2.8.9' and it worked well but afterwards I decided to exclude it as well as follows:

implementation ('technology.tabula:tabula:1.0.5')  {
        exclude group: "org.slf4j", module: "slf4j-simple"
        exclude group: "org.bouncycastle", module: "bcprov-jdk15on"
        exclude group: "com.google.code.gson", module: "gson"
    }

and it also worked pretty well just like before, I guess this way we would be over the security vulnerabilities mentioned earlier. (I will commit it and push it)

If merged I believe this might also close #2255 (when the remote/demo server is updated)

Either way, I found a bug in the UI when trying to convert from PDF to CSV, but I believe I would have to post a separate issue for it

- Exclude gson and bcprov-jdk15on from tabula-java due to detected security vulnerabilities.

omar-ahmed42 added 5 commits November 23, 2024 15:02

Add Tabula dependency and exclude slf4j-simple

a3cd3d0

- Add tabula-java dependency to extract tables into CSV. - Exclude slf4j-simple due to Logback

Add a flexible CSVWriter

a47296f

- Add FlexibleCSVWriter which extends CSVWriter to pass a custom CSVFormat, as CSVWriter's parameterized constructor (that allows changing CSVFormat) is protected.

Use Tabula in extracting tables from PDF

5565dbe

- Use Tabula in extracting tables from PDF instead of the existing implementation

Delete PDFTableStripper as It is unneeded

930e072

- Delete PDFTableStripper as It is unneeded as Tabula-Java is used instead.

Use correct class in ExtractCSVController logger

441ab2e

omar-ahmed42 requested a review from Frooodle as a code owner November 23, 2024 14:32

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Nov 23, 2024

github-actions bot added Java Pull requests that update Java code Back End Issues related to back-end development API API-related issues or pull requests labels Nov 23, 2024

dosubot bot added the enhancement New feature or request label Nov 23, 2024

Exclude gson and bcprov-jdk15on dependencies from tabula

3459a29

- Exclude gson and bcprov-jdk15on from tabula-java due to detected security vulnerabilities.

Frooodle merged commit afad06b into Stirling-Tools:main Nov 23, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract tables from PDF to CSV using Tabula #2312

Extract tables from PDF to CSV using Tabula #2312

omar-ahmed42 commented Nov 23, 2024

Frooodle commented Nov 23, 2024

Frooodle commented Nov 23, 2024

Frooodle commented Nov 23, 2024

omar-ahmed42 commented Nov 23, 2024

omar-ahmed42 commented Nov 23, 2024

Extract tables from PDF to CSV using Tabula #2312

Extract tables from PDF to CSV using Tabula #2312

Conversation

omar-ahmed42 commented Nov 23, 2024

Description

Checklist

Frooodle commented Nov 23, 2024

Frooodle commented Nov 23, 2024

Frooodle commented Nov 23, 2024

omar-ahmed42 commented Nov 23, 2024

omar-ahmed42 commented Nov 23, 2024