-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
I want to read this PDF file but PyPDF2 starts hanging forever spinning at 100% CPU while reading the PDF.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-5.4.0-122-generic-x86_64-with-glibc2.29
$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3
Code + PDF
This is a minimal, complete example that shows the issue:
import PyPDF2
with open("The lean times in the Peruvian economy.pdf", "rb") as f:
pdfreader = PyPDF2.PdfFileReader(f, strict=True)
npage = 0
for page in pdfreader.pages:
npage += 1
print(f"Reading page {npage} of {pdfreader.numPages}")
a = page.extractText()
PDF used above: The lean times in the Peruvian economy.pdf
Output of the script
Reading page 1 of 19
Reading page 2 of 19
Reading page 3 of 19
Reading page 4 of 19
Reading page 5 of 19
Reading page 6 of 19
Reading page 7 of 19
Reading page 8 of 19
Reading page 9 of 19
Reading page 10 of 19
Reading page 11 of 19
Reading page 12 of 19
Reading page 13 of 19
Reading page 14 of 19
Reading page 15 of 19
Reading page 16 of 19
At this point, the script starts spinning at 100% CPU for more than half an hour when I manually terminated it.
Preliminary code analysis
The code is spinning in this loop:
https://github.com/py-pdf/PyPDF2/blob/84460f54aa4721db36452fe510f8063838e358d5/PyPDF2/_cmap.py#L273-L282
with very large value of b = 438093348969
. After roughly one minute a
grew by 9430662
suggesting this loop would running for more than 32 days. For any other page in this PDF, b
never exceeds 0xFFFD
which would make this loop finish in about 0.4s.
The lack of comments and the inconclusive variable names prevent any further debugging attempts from my side but, hopefully, this gives the maintainers a hint to what they should be looking at.