Can't get text from the pdf #1465

hemilparmar · 2022-12-01T14:30:56Z

hemilparmar
Dec 1, 2022

when i parse this pdf i can not get the content of pdf
test.pdf
test2.pdf

PyPDF2 version : 2.11.2

Code:

import PyPDF2
pdfFileObj = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(1)
print(pageObj.extractText())
pdfFileObj.close()

Answered by MartinThoma

Dec 1, 2022

From https://pypdf2.readthedocs.io/en/latest/user/extract-text.html

Sometimes PDFs do not contain the text as it’s displayed, but instead an image. You notice that when you cannot copy the text. Then there are PDF files that contain an image and a text layer in the background. That typically happens when a document was scanned. Although the scanning software (OCR) is pretty good today, it still fails once in a while. PyPDF2 is no OCR software; it will not be able to detect those failures. PyPDF2 will also never be able to extract text from images.

View full answer

MartinThoma · 2022-12-01T15:16:54Z

MartinThoma
Dec 1, 2022
Maintainer

From https://pypdf2.readthedocs.io/en/latest/user/extract-text.html

Sometimes PDFs do not contain the text as it’s displayed, but instead an image. You notice that when you cannot copy the text. Then there are PDF files that contain an image and a text layer in the background. That typically happens when a document was scanned. Although the scanning software (OCR) is pretty good today, it still fails once in a while. PyPDF2 is no OCR software; it will not be able to detect those failures. PyPDF2 will also never be able to extract text from images.

2 replies

hemilparmar Dec 1, 2022
Author

Yes , can you see the test2.pdf? In that pdf file, text is there but it gives the random numbers like some text encryption is there.

pubpub-zz Dec 1, 2022
Maintainer

your pdf has been volontarily generated to prevent text extraction
have a look at #1171 for more details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't get text from the pdf #1465

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Can't get text from the pdf #1465

hemilparmar Dec 1, 2022

Replies: 1 comment · 2 replies

MartinThoma Dec 1, 2022 Maintainer

hemilparmar Dec 1, 2022 Author

pubpub-zz Dec 1, 2022 Maintainer

hemilparmar
Dec 1, 2022

Replies: 1 comment 2 replies

MartinThoma
Dec 1, 2022
Maintainer

hemilparmar Dec 1, 2022
Author

pubpub-zz Dec 1, 2022
Maintainer