Skip to content

Conversation

@ricflams
Copy link
Contributor

@ricflams ricflams commented Sep 26, 2025

On 8000 pdf-files PdfPig failed to read the correct StructTree-object for about 1% of them. The StructTree object was simply missing in the CrossReferenceTable.

It turned out that the constructed CrossReferenceTable could miss Stream-parts if there were multiple Table-parts because a stream will only be added if it's associated with the very first Table-part. The remedy would seem to be to check for and add streams that are associated with any of the Table-parts, not just the first one.
☝️ That was for the old code in CrossReferenceTableBuilder. I'm still including that just FYI. The fix for the refactored xref code is in FirstPassParser.cs instead: associated streams were just not being added.

On a sample of 72 files where this failed, this change fixed the StructTree for all of them.
I've included four files where this was fixed:

doc-00044a646349ce519d9b7287159795cc.pdf
doc-00072bb64c86990e7391ef5baca69099.pdf
doc-000558f2b1d03d09ea6bb359732a8f09.pdf
doc-000518224506a58efb993b67ef57661b.pdf

Obsolete comment:
I fetched latest master (it had been a couple of months) and realized that the code wasn't called anymore because the parsing of xrefs had been completely refactored: 0afe021. The refactored PdfPig-code still can't deal with that situation (for instance the 4 files I've attached) so I'm back to square one. Even though my change in this PR is for a file than are no longer in play I'm submitting it anyway, since it may give you a hint about what's currently wrong and how to fix it. I really appreciate the effort you're putting into PdfPig. Just rather bad timing that the fix I made today for a problem that had been troubling me for months, was in vain because that part has been refactored three weeks ago 🥲

The fix also revealed that one of the github test-case files, ErcotFacts.pdf in Issue874, was not being read correctly before. It actually included more text that was only referred to in the missing stream-xref: here's a diff of the text with/without this PR's fix, how the pdf is rendered in Chrome, and what text Aspose.PDF extracts:

6nMs8tLGbh image

The fix also means that ErcotFacts.pdf is no longer missing a font so that test-case has been changed too.

On a large sample of pdf-files PdfPig failed to read the correct StructTree-object for about 1% of them. The StructTree object was simply missing in the CrossReferenceTable.CrossReferenceTable.
It turned out that the constructed CrossReferenceTable could miss Stream-parts if there were multiple Table-parts because a stream will only be added if it's associated with the very first Table-part. The remedy would seem to be to check for and add streams that are associated with any of the Table-parts, not just the first one.
On a sample of 72 files where this failed, this changed fixed the StructTree for all of them.
@BobLd
Copy link
Collaborator

BobLd commented Sep 28, 2025

@ricflams thanks a lot for sharing the insight, even if outdated now. @EliotJones worked on the refactoring of this part of the code, so ill let him have a look first.

If he does not have time, ill try to find the time to take care of the change. In this context, if you could share files that are failing (the 1%), that'd be very helpful.

That being said, if you i have time to look into your fix in the context of the new code, that'd be amazing. Thanks again!

@ricflams
Copy link
Contributor Author

ricflams commented Sep 28, 2025

Hey and thanks for the quick reply

if you could share files that are failing (the 1%), that'd be very helpful.

I included 4 pdfs which should get you (Eliot) going, I think. But let me know if you need more.

I ran through about 10K pdfs and 84 failed. This incomplete xref was the reason for 72 of them and the remaining 12 looked legit broken: eg xref-pointers into nowhere etc. And PAC (I didn’t try Acrobat) also complained those 12 were broken. So it really did seem like it was just fixing “all the problems”.

When cheking more files, like a sample of 100K pdfs, I’ve generally seen PdfPig doing really well at parsing - as good as PAC and Acrobat, it feels like.

That being said, if you i have time to look into your fix in the context of the new code, that'd be amazing. Thanks again!

Maybe I do in a week. I took a quick look after I realized the old xref-table-code wasn’t in play anymore and it just looked so completely different with the bruteforce-approach that it would take me a some time to see where it went wrong with the stream-parts. And I figured you guys might know just what to do, once you were made aware of the issue and had some sample files to look at. And, it was late 😉

@ricflams
Copy link
Contributor Author

@BobLd fyi I took a look this morning and spotted how to fix it in the refactored xref-reading-code. I’ll update this PR with that fix instead later today.

- If an XrefTable has an associated stream, as indicated via the XrefStm-property, then read and add that XrefStream
- Any table can have 0 or 1 such associated streams
- A caveat: such an associated stream might also theoretically be part of the Parts-sequence in which case it would be encountered both by looping through all those parts along with all the regular tables and now also by association to any of those tables. It doesn't seem harmful since the offsets are flattened eventually anyway and stored by their offset-key into a mapping-table.
@ricflams ricflams changed the title Add xref-streams tied to any parts, not just the first Bugfix: xref-streams were not added Sep 29, 2025
With the fix for including associated streams, this test now finds more text on the first page. I've verified using Aspose.PDF and by viewing the ErcotFacts.pdf file being tested that yes, it was indeed missing part of the text before.
Page two has had four more characters added, which is now delected by this xref-stream fix
Including the stream-xref means that the formerly missing font is no longer missing, so simply run the two test-cases under the (stricter) assumption of SkipMissingFonts=false.
@BobLd
Copy link
Collaborator

BobLd commented Sep 30, 2025

Thanks a lot @ricflams ! Great stuff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants