Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDFTextStripper.getText misses some characters from a PDF file #519

Open
jamiehiggins opened this issue Jun 27, 2023 · 2 comments
Open

PDFTextStripper.getText misses some characters from a PDF file #519

jamiehiggins opened this issue Jun 27, 2023 · 2 comments
Labels
type: bug Existing feature doesn't work correctly

Comments

@jamiehiggins
Copy link

jamiehiggins commented Jun 27, 2023

When attempting to extract text from the attached simple PDF file there are some characters missing within the text.

To reproduce the problem simply call pdfStripper.getText() on the attached pdf file (Problematic.pdf)

The text is mostly returned ok, however the following issues are present in the returned text:

making time to reflect and review your -> making time to reect and review your
If you find it easier -> If you nd it easier

PdfBox-Android version: [e.g. 2.0.27.0]
It happens on all versions of Android SDK (I have tried several)

Problematic.pdf

@jamiehiggins jamiehiggins added the type: bug Existing feature doesn't work correctly label Jun 27, 2023
@THausherr
Copy link

This is an unsolved problem
https://issues.apache.org/jira/browse/PDFBOX-3248

In this file, the /ToUnicode file maps ligatures to 0 and uses the /ActualText feature in the content stream which PDFBox doesn't support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Existing feature doesn't work correctly
Projects
None yet
Development

No branches or pull requests

2 participants