Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using PdfReader causes a crash: KeyError: '/Root' #2841

Closed
kalpeshmantri opened this issue Sep 12, 2024 · 4 comments
Closed

Using PdfReader causes a crash: KeyError: '/Root' #2841

kalpeshmantri opened this issue Sep 12, 2024 · 4 comments
Labels
is-robustness-issue From a users perspective, this is about robustness needs-pdf The issue needs a PDF file to show the problem PdfReader The PdfReader component is affected

Comments

@kalpeshmantri
Copy link

kalpeshmantri commented Sep 12, 2024

What happened? What were you trying to achieve?

Environment

Which environment were you using when you encountered the problem?

C:\>python -m platform
Windows-11-10.0.22631-SP0

C:\>python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('cryptography', '42.0.7'), PIL=10.2.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader(r"C:\samples\a.pdf")
number_of_pages = len(reader.pages)   #<--- crash
page = reader.pages[0]
text = page.extract_text()

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

(Got 100's of PDF with this issue but) 1 Sample PDF inside zip:
---- file deleted for security reasons -----

I believe crash is due to some buggy pdf structure like it does not have xref table.

Traceback

This is the complete traceback I see:

incorrect startxref pointer(1)
Traceback (most recent call last):
  File "c:\python\Lib\runpy.py", line 198, in _run_module_as_main
    return _run_code(code, main_globals, None,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\python\Lib\runpy.py", line 88, in _run_code
    exec(code, run_globals)
  File "c:\Users\.vscode\extensions\ms-python.debugpy-2024.10.0-win32-x64\bundled\libs\debugpy\adapter/../..\debugpy\launcher/../..\debugpy\__main__.py", line 39, in <module>
    cli.main()
  File "c:\Users\.vscode\extensions\ms-python.debugpy-2024.10.0-win32-x64\bundled\libs\debugpy\adapter/../..\debugpy\launcher/../..\debugpy/..\debugpy\server\cli.py", line 430, in main
    run()
  File "c:\Users\.vscode\extensions\ms-python.debugpy-2024.10.0-win32-x64\bundled\libs\debugpy\adapter/../..\debugpy\launcher/../..\debugpy/..\debugpy\server\cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "c:\Users\.vscode\extensions\ms-python.debugpy-2024.10.0-win32-x64\bundled\libs\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\.vscode\extensions\ms-python.debugpy-2024.10.0-win32-x64\bundled\libs\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "c:\Users\.vscode\extensions\ms-python.debugpy-2024.10.0-win32-x64\bundled\libs\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "C:\samples\pypdf_test.py", line 4, in <module>
    number_of_pages = len(reader.pages)
                      ^^^^^^^^^^^^^^^^^
  File "c:\python\Lib\site-packages\pypdf\_page.py", line 2208, in __len__
    return self.length_function()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "c:\python\Lib\site-packages\pypdf\_doc_common.py", line 353, in get_num_pages
    self._flatten()
  File "c:\python\Lib\site-packages\pypdf\_doc_common.py", line 1098, in _flatten
    catalog = self.root_object
              ^^^^^^^^^^^^^^^^
  File "c:\python\Lib\site-packages\pypdf\_reader.py", line 159, in root_object
    return cast(DictionaryObject, self.trailer[TK.ROOT].get_object())
                                  ~~~~~~~~~~~~^^^^^^^^^
  File "c:\python\Lib\site-packages\pypdf\generic\_data_structures.py", line 409, in __getitem__
    return dict.__getitem__(self, key).get_object()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: '/Root'
@kalpeshmantri kalpeshmantri changed the title Using PdfReader causes a crash Using PdfReader causes a crash: KeyError: '/Root' Sep 12, 2024
@pubpub-zz
Copy link
Collaborator

Your file is reported as a Threat (Trojan:PDF/Phish!MSR)
your file can not be analysed and I delete it for security reason
If you have non dangerous file. please provide

@stefan6419846 stefan6419846 added PdfReader The PdfReader component is affected is-robustness-issue From a users perspective, this is about robustness needs-pdf The issue needs a PDF file to show the problem labels Sep 13, 2024
@kalpeshmantri
Copy link
Author

Here are couple for more PDFs:
error_samples.zip

@pubpub-zz
Copy link
Collaborator

a.pdf can not be open neither with pdf.js nor acrobat.
b.pdf is strongly damaged : no crossref can be found and no startxref and other issues exist

you can not expect that pypdf will be able to handle such damaged files.

@pubpub-zz
Copy link
Collaborator

I close this issue. New issue should be created for new cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-robustness-issue From a users perspective, this is about robustness needs-pdf The issue needs a PDF file to show the problem PdfReader The PdfReader component is affected
Projects
None yet
Development

No branches or pull requests

3 participants