Skip to content

PyPDF2 forever spinning at 100% CPU #1285

@DL6ER

Description

@DL6ER

I want to read this PDF file but PyPDF2 starts hanging forever spinning at 100% CPU while reading the PDF.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-122-generic-x86_64-with-glibc2.29

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3

Code + PDF

This is a minimal, complete example that shows the issue:

import PyPDF2
with open("The lean times in the Peruvian economy.pdf", "rb") as f:
  pdfreader = PyPDF2.PdfFileReader(f, strict=True)
  npage = 0
  for page in pdfreader.pages:
    npage += 1
    print(f"Reading page {npage} of {pdfreader.numPages}")
    a = page.extractText()

PDF used above: The lean times in the Peruvian economy.pdf

Output of the script

Reading page 1 of 19
Reading page 2 of 19
Reading page 3 of 19
Reading page 4 of 19
Reading page 5 of 19
Reading page 6 of 19
Reading page 7 of 19
Reading page 8 of 19
Reading page 9 of 19
Reading page 10 of 19
Reading page 11 of 19
Reading page 12 of 19
Reading page 13 of 19
Reading page 14 of 19
Reading page 15 of 19
Reading page 16 of 19

At this point, the script starts spinning at 100% CPU for more than half an hour when I manually terminated it.

Preliminary code analysis

The code is spinning in this loop:
https://github.com/py-pdf/PyPDF2/blob/84460f54aa4721db36452fe510f8063838e358d5/PyPDF2/_cmap.py#L273-L282
with very large value of b = 438093348969. After roughly one minute a grew by 9430662 suggesting this loop would running for more than 32 days. For any other page in this PDF, b never exceeds 0xFFFD which would make this loop finish in about 0.4s.

The lack of comments and the inconclusive variable names prevent any further debugging attempts from my side but, hopefully, this gives the maintainers a hint to what they should be looking at.

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFnf-performanceNon-functional change: Performancenf-securityNon-functional change: Securityworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions