Skip to content

KeyError: b'1' in read_string_from_stream #1294

@DL6ER

Description

@DL6ER

See #1269 for further details, this reports another issue I've come accross.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-122-generic-x86_64-with-glibc2.29

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3

Code + PDF

This is a minimal, complete example that shows the issue:

import PyPDF2
with open("AN EXACT ANALYTICAL SOLUTION OF KEPLER'S EQUATION.pdf", "rb") as f:
  pdfreader = PyPDF2.PdfFileReader(f, strict=False)
  full_content = " ".join([page.extractText() for page in pdfreader.pages])

PDF used above: AN EXACT ANALYTICAL SOLUTION OF KEPLER'S EQUATION.pdf

Another example triggering the exact same issue: How to time-stamp a digital document-88624977e595442fde2b0bd4fe3a0fae.pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_utils.py", line 80, in read_string_from_stream
    tok = escape_dict[tok]
KeyError: b'1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1538, in extractText
    return self.extract_text()
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1510, in extract_text
    return self._extract_text(
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_page.py", line 1157, in _extract_text
    content = ContentStream(content, pdf, "bytes")
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 689, in __init__
    self.__parse_content_stream(stream_bytes)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 719, in __parse_content_stream
    operands.append(read_object(stream, None, self.forced_encoding))
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_data_structures.py", line 830, in read_object
    return read_string_from_stream(stream, forced_encoding)
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/generic/_utils.py", line 95, in read_string_from_stream
    tok = b_(chr(int(tok, base=8)))
ValueError: invalid literal for int() with base 8: b'169'

The PDF can be read using a normal PDF viewer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    key-errorCould be a bug, but also a robustness issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions