Expected object ID (5 0) does not match actual (4 0); xref table not zero-indexed #672

ghost · 2022-04-08T09:39:53Z

There are two cases producing this error:

First Case

from PyPDF2 import PdfFileReader, PdfFileMerger, __version__ as ver

if __name__ == '__main__':
    print( f'PyPDF2.__version__ = {ver}' )

    # 1st error
    pdf_merger = PdfFileMerger()
    pdf_merger.append( 'blah.pdf' )  # <-----

causing

"C:\Program Files\Python310\python.exe" C:/Users/thomas/Documents/vonKellerPC/Softwareentwicklung/tkPraxisDj/pypdf2error.py
PyPDF2.__version__ = 1.27.1
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1805]
Traceback (most recent call last):
  File "C:\Users\thomas\Documents\vonKellerPC\Softwareentwicklung\tkPraxisDj\pypdf2error.py", line 8, in <module>
    pdf_merger.append( 'blah.pdf' ) # crash
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\merger.py", line 203, in append
    self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\merger.py", line 139, in merge
    pages = (0, pdfr.getNumPages())
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\pdf.py", line 1223, in getNumPages
    self._flatten()
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\pdf.py", line 1573, in _flatten
    catalog = self.trailer["/Root"].getObject()
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\generic.py", line 177, in getObject
    return self.pdf.getObject(self).getObject()
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\pdf.py", line 1671, in getObject
    raise utils.PdfReadError("Expected object ID (%d %d) does not match actual (%d %d); xref table not zero-indexed." \
PyPDF2.utils.PdfReadError: Expected object ID (5 0) does not match actual (4 0); xref table not zero-indexed.

Process finished with exit code 1

Second Case

from PyPDF2 import PdfFileReader, PdfFileMerger, __version__ as ver

if __name__ == '__main__':
    print( f'PyPDF2.__version__ = {ver}' )

    # 2nd error
    file_in = open('blah.pdf', 'rb')
    pdf_reader = PdfFileReader(file_in)
    metadata = pdf_reader.getDocumentInfo()   # <-----

causing

"C:\Program Files\Python310\python.exe" C:/Users/thomas/Documents/vonKellerPC/Softwareentwicklung/tkPraxisDj/pypdf2error.py
PyPDF2.__version__ = 1.27.1
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1805]
Traceback (most recent call last):
  File "C:\Users\thomas\Documents\vonKellerPC\Softwareentwicklung\tkPraxisDj\pypdf2error.py", line 9, in <module>
    metadata = pdf_reader.getDocumentInfo() # crash
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\pdf.py", line 1169, in getDocumentInfo
    obj = self.trailer['/Info']
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\generic.py", line 516, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\generic.py", line 177, in getObject
    return self.pdf.getObject(self).getObject()
  File "C:\Program Files\Python310\lib\site-packages\PyPDF2\pdf.py", line 1671, in getObject
    raise utils.PdfReadError("Expected object ID (%d %d) does not match actual (%d %d); xref table not zero-indexed." \
PyPDF2.utils.PdfReadError: Expected object ID (6 0) does not match actual (5 0); xref table not zero-indexed.

Process finished with exit code 1

Both of them with the PDF file "blah.pdf" created by the scanner of a HP Officejet 8010 printer: blah.pdf

My python version is 3.10.1. I work on Windows 10. The errors happen in a pyCharm Comunity environment.
For a similar error please refer to issue #566 from July 2020 closed yesterday.

The text was updated successfully, but these errors were encountered:

johns1c · 2022-04-09T10:29:35Z

Strongly suspect that the PDF is in error. PDF Xchange editor reports "Error Loading object - misaligned object"
Do all documents scanned with your printer have this error? I have a HP 6520 and scans are unreadable in some circumstances and always show minor discrepancies between the length of the image streams given by /Length and the positioning of the end of stream marker, which conceivably could affect the entire object structure.
I suggest that we add a tag to this issue indicating that it relates to the handling of a bad pdf
I suggest the issue is closed as the PyPDF2 error reporting seems to be reasonable
I will take a look at the internal structure using Didier Stephens pdf parser today If I have time.

johns1c · 2022-04-09T11:02:58Z

xref table seems to be present but empty. This is not in agreement with the PDF specification 7.5.4 says "the table shall contain" and as far as I see the document does not contain a "cross-reference stream" as per section 7.5.8.

johns1c · 2022-04-09T23:43:14Z

I had a second look at this and find that the xref

is present - I was wrong earlier, the tool I was using failed to print the contents
is incorrect - the line immediately below the line XREF should be 0 7 not `1 7'

PyPDF2.PdfFileReader will handle the file if strict=False is used.

PyPDF2.PdfFileMerger also takes a strict parameter in the constructor and also passes test if
strict=False is added to the line pdf_merger = PdfFileMerger()

johns1c · 2022-04-09T23:56:56Z

NO FAULT - PyPDF2 DOES handle this type of badly formed PDF if `strict=False` is used

tests carried out using the sample file today show

PyPDF2 correctly report the faulty input in its default setting. The source PDF is badly formed
PyPDF2 correctly handles the faulty input if strict=False is used - writing the it we and will fix the erroneous input

MartinThoma · 2022-04-19T17:09:23Z

I just confirmed - thank you @johns1c . I'm closing this as everything seems right. However, I think about adding strict=False to many / all examples in the docs 🤔

MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Apr 8, 2022

MartinThoma added is-robustness-issue From a users perspective, this is about robustness and removed is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Apr 19, 2022

MartinThoma closed this as completed Apr 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expected object ID (5 0) does not match actual (4 0); xref table not zero-indexed #672

Expected object ID (5 0) does not match actual (4 0); xref table not zero-indexed #672

ghost commented Apr 8, 2022

johns1c commented Apr 9, 2022

johns1c commented Apr 9, 2022

johns1c commented Apr 9, 2022

johns1c commented Apr 9, 2022

MartinThoma commented Apr 19, 2022

Expected object ID (5 0) does not match actual (4 0); xref table not zero-indexed #672

Expected object ID (5 0) does not match actual (4 0); xref table not zero-indexed #672

Comments

ghost commented Apr 8, 2022

First Case

Second Case

johns1c commented Apr 9, 2022

johns1c commented Apr 9, 2022

johns1c commented Apr 9, 2022

johns1c commented Apr 9, 2022

NO FAULT - PyPDF2 DOES handle this type of badly formed PDF if strict=False is used

MartinThoma commented Apr 19, 2022

NO FAULT - PyPDF2 DOES handle this type of badly formed PDF if `strict=False` is used