Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract tables from PDF to CSV using Tabula #2312

Merged

Conversation

omar-ahmed42
Copy link
Contributor

Description

  • Add Tabula dependency to use in extracting tables from PDF files.
  • Exclude slf4j-simple from tabula-java dependency as it caused errors in the project when used along with Logback.
  • Add FlexibleCSVWriter to allow using different CSVFormat rules, such as quoting all values (i.e: "1", "my value", "my 2nd column value") as CSVWriter's parameterized constructor is protected (thus, we couldn't customize the CSVFormat).
  • Utilize Tabula in extracting the table and its values and writing them in CSV format.
  • Delete PDFTableStripper as it is no longer needed (Tabula is sufficient).
  • Use correct class in the logger present in ExtractCSVController (from CropController.class -to-> ExtractCSVController.class).

Closes #1614

Checklist

  • I have read the Contribution Guidelines
  • I have performed a self-review of my own code
  • I have attached images of the change if it is UI based
  • I have commented my code, particularly in hard-to-understand areas
  • If my code has heavily changed functionality I have updated relevant docs on Stirling-PDFs doc repo
  • My changes generate no new warnings
  • I have read the section Add New Translation Tags (for new translation tags only)

- Add tabula-java dependency to extract tables into CSV.
- Exclude slf4j-simple due to Logback
- Add FlexibleCSVWriter which extends CSVWriter to pass a custom CSVFormat, as CSVWriter's parameterized constructor (that allows changing CSVFormat) is protected.
- Use Tabula in extracting tables from PDF instead of the existing implementation
- Delete PDFTableStripper as It is unneeded as Tabula-Java is used instead.
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Nov 23, 2024
@github-actions github-actions bot added Java Pull requests that update Java code Back End Issues related to back-end development API API-related issues or pull requests labels Nov 23, 2024
@dosubot dosubot bot added the enhancement New feature or request label Nov 23, 2024
@Frooodle
Copy link
Member

hmmm need to think about this a lil more
image

@Frooodle
Copy link
Member

Can you add
implementation ('technology.tabula:tabula:1.0.5') {
exclude group: "org.slf4j", module: "slf4j-simple"
exclude group: "org.bouncycastle", module: "bcprov-jdk15on"
}
implementation 'com.google.code.gson:gson:2.8.9'

and see if it still works?

@Frooodle
Copy link
Member

also did this improve PDF to CSV conversions? any ideas?

@omar-ahmed42
Copy link
Contributor Author

Can you add implementation ('technology.tabula:tabula:1.0.5') { exclude group: "org.slf4j", module: "slf4j-simple" exclude group: "org.bouncycastle", module: "bcprov-jdk15on" } implementation 'com.google.code.gson:gson:2.8.9'

and see if it still works?

Sure, I will give it a try and inform you afterwards

also did this improve PDF to CSV conversions? any ideas?

When I tested it I believe it improved the conversion as the old version would work correctly for some PDF files and fail to work some other files (it would either extract the entire table and place on 1 row - column by column - or would only extract the first 4 values in the first column and also place it in the same line, but in other cases it worked fine), as for tabula I attempted to use it on multiple files including the ones that the old version failed to work on, and it worked correctly/well

@omar-ahmed42
Copy link
Contributor Author

I added the excluded dependencies and added gson as you mentioned

implementation ('technology.tabula:tabula:1.0.5') {
exclude group: "org.slf4j", module: "slf4j-simple"
exclude group: "org.bouncycastle", module: "bcprov-jdk15on"
}
implementation 'com.google.code.gson:gson:2.8.9'

but this caused testMainApplicationStartup() in stirling.software.SPDF.SPDFApplicationTest to fail with the following exception "java.lang.IllegalStateException: java.lang.NoClassDefFoundError: com/google/gson/Strictness" (I could provide you with the full stack trace if required).

Anyway, so I decided to attempt to run the application without implementation 'com.google.code.gson:gson:2.8.9' and it worked well but afterwards I decided to exclude it as well as follows:

implementation ('technology.tabula:tabula:1.0.5')  {
        exclude group: "org.slf4j", module: "slf4j-simple"
        exclude group: "org.bouncycastle", module: "bcprov-jdk15on"
        exclude group: "com.google.code.gson", module: "gson"
    }

and it also worked pretty well just like before, I guess this way we would be over the security vulnerabilities mentioned earlier. (I will commit it and push it)

If merged I believe this might also close #2255 (when the remote/demo server is updated)

Either way, I found a bug in the UI when trying to convert from PDF to CSV, but I believe I would have to post a separate issue for it

- Exclude gson and bcprov-jdk15on from tabula-java due to detected security vulnerabilities.
@Frooodle Frooodle merged commit afad06b into Stirling-Tools:main Nov 23, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API API-related issues or pull requests Back End Issues related to back-end development enhancement New feature or request Java Pull requests that update Java code size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: enhance CSV excel conversion
2 participants