Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expected object ID (5 0) does not match actual (4 0); xref table not zero-indexed #672

Closed
ghost opened this issue Apr 8, 2022 · 5 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-robustness-issue From a users perspective, this is about robustness

Comments

@ghost
Copy link

ghost commented Apr 8, 2022

There are two cases producing this error:

First Case

from PyPDF2 import PdfFileReader, PdfFileMerger, __version__ as ver

if __name__ == '__main__':
    print( f'PyPDF2.__version__ = {ver}' )

    # 1st error
    pdf_merger = PdfFileMerger()
    pdf_merger.append( 'blah.pdf' )  # <-----

causing

"C:\Program Files\Python310\python.exe" C:/Users/thomas/Documents/vonKellerPC/Softwareentwicklung/tkPraxisDj/pypdf2error.py
PyPDF2.__version__ = 1.27.1
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1805]
Traceback (most recent call last):
  File "C:\Users\thomas\Documents\vonKellerPC\Softwareentwicklung\tkPraxisDj\pypdf2error.py", line 8, in <module>
    pdf_merger.append( 'blah.pdf' ) # crash
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\merger.py", line 203, in append
    self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\merger.py", line 139, in merge
    pages = (0, pdfr.getNumPages())
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\pdf.py", line 1223, in getNumPages
    self._flatten()
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\pdf.py", line 1573, in _flatten
    catalog = self.trailer["/Root"].getObject()
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\generic.py", line 177, in getObject
    return self.pdf.getObject(self).getObject()
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\pdf.py", line 1671, in getObject
    raise utils.PdfReadError("Expected object ID (%d %d) does not match actual (%d %d); xref table not zero-indexed." \
PyPDF2.utils.PdfReadError: Expected object ID (5 0) does not match actual (4 0); xref table not zero-indexed.

Process finished with exit code 1

Second Case

from PyPDF2 import PdfFileReader, PdfFileMerger, __version__ as ver

if __name__ == '__main__':
    print( f'PyPDF2.__version__ = {ver}' )

    # 2nd error
    file_in = open('blah.pdf', 'rb')
    pdf_reader = PdfFileReader(file_in)
    metadata = pdf_reader.getDocumentInfo()   # <-----

causing

"C:\Program Files\Python310\python.exe" C:/Users/thomas/Documents/vonKellerPC/Softwareentwicklung/tkPraxisDj/pypdf2error.py
PyPDF2.__version__ = 1.27.1
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1805]
Traceback (most recent call last):
  File "C:\Users\thomas\Documents\vonKellerPC\Softwareentwicklung\tkPraxisDj\pypdf2error.py", line 9, in <module>
    metadata = pdf_reader.getDocumentInfo() # crash
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\pdf.py", line 1169, in getDocumentInfo
    obj = self.trailer['/Info']
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\generic.py", line 177, in getObject
    return self.pdf.getObject(self).getObject()
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\pdf.py", line 1671, in getObject
    raise utils.PdfReadError("Expected object ID (%d %d) does not match actual (%d %d); xref table not zero-indexed." \
PyPDF2.utils.PdfReadError: Expected object ID (6 0) does not match actual (5 0); xref table not zero-indexed.

Process finished with exit code 1

Both of them with the PDF file "blah.pdf" created by the scanner of a HP Officejet 8010 printer: blah.pdf

My python version is 3.10.1. I work on Windows 10. The errors happen in a pyCharm Comunity environment.
For a similar error please refer to issue #566 from July 2020 closed yesterday.

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Apr 8, 2022
@johns1c
Copy link

johns1c commented Apr 9, 2022

  1. Strongly suspect that the PDF is in error. PDF Xchange editor reports "Error Loading object - misaligned object"
  2. Do all documents scanned with your printer have this error? I have a HP 6520 and scans are unreadable in some circumstances and always show minor discrepancies between the length of the image streams given by /Length and the positioning of the end of stream marker, which conceivably could affect the entire object structure.
  3. I suggest that we add a tag to this issue indicating that it relates to the handling of a bad pdf
  4. I suggest the issue is closed as the PyPDF2 error reporting seems to be reasonable
  5. I will take a look at the internal structure using Didier Stephens pdf parser today If I have time.

@johns1c
Copy link

johns1c commented Apr 9, 2022

xref table seems to be present but empty. This is not in agreement with the PDF specification 7.5.4 says "the table shall contain" and as far as I see the document does not contain a "cross-reference stream" as per section 7.5.8.

@johns1c
Copy link

johns1c commented Apr 9, 2022

I had a second look at this and find that the xref

  1. is present - I was wrong earlier, the tool I was using failed to print the contents
  2. is incorrect - the line immediately below the line XREF should be 0 7 not `1 7'

PyPDF2.PdfFileReader will handle the file if strict=False is used.

PyPDF2.PdfFileMerger also takes a strict parameter in the constructor and also passes test if
strict=False is added to the line pdf_merger = PdfFileMerger()

@johns1c
Copy link

johns1c commented Apr 9, 2022

NO FAULT - PyPDF2 DOES handle this type of badly formed PDF if strict=False is used

tests carried out using the sample file today show

  1. PyPDF2 correctly report the faulty input in its default setting. The source PDF is badly formed
  2. PyPDF2 correctly handles the faulty input if strict=False is used - writing the it we and will fix the erroneous input

@MartinThoma MartinThoma added is-robustness-issue From a users perspective, this is about robustness and removed is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Apr 19, 2022
@MartinThoma
Copy link
Member

I just confirmed - thank you @johns1c . I'm closing this as everything seems right. However, I think about adding strict=False to many / all examples in the docs 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-robustness-issue From a users perspective, this is about robustness
Projects
None yet
Development

No branches or pull requests

2 participants