Skip to content

Can't get text from the pdf #1465

Answered by MartinThoma
hemilparmar asked this question in Q&A
Discussion options

You must be logged in to vote

From https://pypdf2.readthedocs.io/en/latest/user/extract-text.html

Sometimes PDFs do not contain the text as it’s displayed, but instead an image. You notice that when you cannot copy the text. Then there are PDF files that contain an image and a text layer in the background. That typically happens when a document was scanned. Although the scanning software (OCR) is pretty good today, it still fails once in a while. PyPDF2 is no OCR software; it will not be able to detect those failures. PyPDF2 will also never be able to extract text from images.

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@hemilparmar
Comment options

@pubpub-zz
Comment options

Answer selected by MartinThoma
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants