-
Couldn't load subscription status.
- Fork 1.5k
Closed
Labels
Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsA minimal, complete and verifiable example helps a lot to debug / understand feature requestsis-robustness-issueFrom a users perspective, this is about robustnessFrom a users perspective, this is about robustnessworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Description
See #1269 for further details, this reports another issue I've come accross.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-5.4.0-122-generic-x86_64-with-glibc2.29
$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3Code + PDF
This is a minimal, complete example that shows the issue:
import PyPDF2
with open("main.pdf", "rb") as f:
pdfreader = PyPDF2.PdfFileReader(f, strict=False)
full_content = " ".join([page.extractText() for page in pdfreader.pages])PDF used above: main.pdf
Traceback
This is the complete Traceback I see:
Traceback (most recent call last):
File "test3.py", line 5, in <module>
content = " ".join([page.extractText() for page in pdfreader.pages])
File "test3.py", line 5, in <listcomp>
content = " ".join([page.extractText() for page in pdfreader.pages])
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1538, in extractText
return self.extract_text()
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1510, in extract_text
return self._extract_text(
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1157, in _extract_text
content = ContentStream(content, pdf, "bytes")
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 689, in __init__
self.__parse_content_stream(stream_bytes)
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 719, in __parse_content_stream
operands.append(read_object(stream, None, self.forced_encoding))
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 851, in read_object
return NumberObject.read_from_stream(stream)
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 299, in read_from_stream
return NumberObject(num)
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_base.py", line 274, in __new__
val = int(value)
ValueError: invalid literal for int() with base 10: b'0,0'
Metadata
Metadata
Assignees
Labels
Has MCVEA minimal, complete and verifiable example helps a lot to debug / understand feature requestsA minimal, complete and verifiable example helps a lot to debug / understand feature requestsis-robustness-issueFrom a users perspective, this is about robustnessFrom a users perspective, this is about robustnessworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow