Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursion error when using clone_from of PdfWriter on PDF 2.0 specification #2839

Closed
stefan6419846 opened this issue Sep 8, 2024 · 4 comments · Fixed by #2865
Closed

Recursion error when using clone_from of PdfWriter on PDF 2.0 specification #2839

stefan6419846 opened this issue Sep 8, 2024 · 4 comments · Fixed by #2865
Labels
PdfWriter The PdfWriter component is affected

Comments

@stefan6419846
Copy link
Collaborator

stefan6419846 commented Sep 8, 2024

Environment

$ python -m platform
Linux-6.8.0-100039-tuxedo-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.3.0

The version effectively is the latest main code.

Code + PDF

This is a minimal, complete example that shows the issue:

>>> from pypdf import PdfWriter
>>> writer = PdfWriter(clone_from='ISO_32000-2-2020_sponsored.pdf')

Using PdfReader and iterating over the pages extracting the text does not fail.

I cannot share the document (1003 pages) here as it is the non-public copy of the PDF 2.0 specification available for free on https://pdfa.org/sponsored-standards/

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/_writer.py", line 233, in __init__
    self.clone_document_from_reader(clone_from)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/_writer.py", line 1150, in clone_document_from_reader
    self.clone_reader_document_root(reader)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/_writer.py", line 1119, in clone_reader_document_root
    self._root_object = reader.root_object.clone(self)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 258, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields, visited)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 369, in _clone
    v.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_base.py", line 274, in clone
    obj.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 258, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields, visited)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 369, in _clone
    v.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_base.py", line 274, in clone
    obj.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 258, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields, visited)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 369, in _clone
    v.clone(pdf_dest, force_duplicate, ignore_fields)
[...]
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 129, in clone
    arr.append(data.clone(pdf_dest, force_duplicate, ignore_fields))
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_base.py", line 274, in clone
    obj.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 258, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields, visited)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 369, in _clone
    v.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 129, in clone
    arr.append(data.clone(pdf_dest, force_duplicate, ignore_fields))
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_base.py", line 274, in clone
    obj.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 258, in clone
    d__._clone(self, pdf_dest, force_duplicate, ignore_fields, visited)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 369, in _clone
    v.clone(pdf_dest, force_duplicate, ignore_fields)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 129, in clone
    arr.append(data.clone(pdf_dest, force_duplicate, ignore_fields))
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_base.py", line 266, in clone
    obj = self.get_object()
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_base.py", line 286, in get_object
    return self.pdf.get_object(self)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/_reader.py", line 381, in get_object
    retval = self._get_object_from_stream(indirect_reference)  # type: ignore
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/_reader.py", line 315, in _get_object_from_stream
    obj_stm: EncodedStreamObject = IndirectObject(stmnum, 0, self).get_object()  # type: ignore
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_base.py", line 286, in get_object
    return self.pdf.get_object(self)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/_reader.py", line 442, in get_object
    retval = read_object(self.stream, self)  # type: ignore
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 1305, in read_object
    return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
  File "/home/stefan/tmp/pypdf/pypdf_upstream/pypdf/generic/_data_structures.py", line 562, in read_from_stream
    if isinstance(length, IndirectObject):
  File "/usr/lib/python3.10/typing.py", line 1503, in __instancecheck__
    issubclass(instance.__class__, cls)):
RecursionError: maximum recursion depth exceeded
@stefan6419846 stefan6419846 added the PdfWriter The PdfWriter component is affected label Sep 8, 2024
@pubpub-zz
Copy link
Collaborator

I have the same behavior in windows/python 3.10.5

however when upgrading to 3.13 (standard) the file can be loaded successfully setting recursionlimit to 5000 (on python 3.10 there is "crash" with stack overflow)

@pubpub-zz
Copy link
Collaborator

@stefan6419846
I propose to close this and convert it into a discussion for history

@stefan6419846
Copy link
Collaborator Author

I do not think we should convert this into a discussion, as this surely is some bug/limitation. Is there any reason why this would not fail for the reader, but for the writer? In any case, I recommend documenting the reason for this inside our docs and propose possible workarounds, like increasing the recursion limit (with an example) or splitting large documents beforehand.

@pubpub-zz
Copy link
Collaborator

I do not think we should convert this into a discussion, as this surely is some bug/limitation. Is there any reason why this would not fail for the reader, but for the writer?

Yes : the objects are only read/loaded/cached into memory when required. in the current design The PdfWriter sucks/clones the root object and all linked objects recursively.

In any case, I recommend documenting the reason for this inside our docs and propose possible workarounds, like increasing the recursion limit (with an example) or splitting large documents beforehand.

then I would propose to add in the document:
"when cloning or merging a document, some recursion error may be experienced. you could try to increase recursive_depth. You may also try some newer python version"

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 22, 2024
stefan6419846 pushed a commit that referenced this issue Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PdfWriter The PdfWriter component is affected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants