Bugfix: xref-streams were not added #1173

ricflams · 2025-09-26T19:52:29Z

On 8000 pdf-files PdfPig failed to read the correct StructTree-object for about 1% of them. The StructTree object was simply missing in the CrossReferenceTable.

It turned out that the constructed CrossReferenceTable could miss Stream-parts if there were multiple Table-parts because a stream will only be added if it's associated with the very first Table-part. The remedy would seem to be to check for and add streams that are associated with any of the Table-parts, not just the first one.
☝️ That was for the old code in CrossReferenceTableBuilder. I'm still including that just FYI. The fix for the refactored xref code is in FirstPassParser.cs instead: associated streams were just not being added.

On a sample of 72 files where this failed, this change fixed the StructTree for all of them.
I've included four files where this was fixed:

doc-00044a646349ce519d9b7287159795cc.pdf
doc-00072bb64c86990e7391ef5baca69099.pdf
doc-000558f2b1d03d09ea6bb359732a8f09.pdf
doc-000518224506a58efb993b67ef57661b.pdf

Obsolete comment:
I fetched latest master (it had been a couple of months) and realized that the code wasn't called anymore because the parsing of xrefs had been completely refactored: 0afe021. The refactored PdfPig-code still can't deal with that situation (for instance the 4 files I've attached) so I'm back to square one. Even though my change in this PR is for a file than are no longer in play I'm submitting it anyway, since it may give you a hint about what's currently wrong and how to fix it. I really appreciate the effort you're putting into PdfPig. Just rather bad timing that the fix I made today for a problem that had been troubling me for months, was in vain because that part has been refactored three weeks ago 🥲

The fix also revealed that one of the github test-case files, ErcotFacts.pdf in Issue874, was not being read correctly before. It actually included more text that was only referred to in the missing stream-xref: here's a diff of the text with/without this PR's fix, how the pdf is rendered in Chrome, and what text Aspose.PDF extracts:

The fix also means that ErcotFacts.pdf is no longer missing a font so that test-case has been changed too.

On a large sample of pdf-files PdfPig failed to read the correct StructTree-object for about 1% of them. The StructTree object was simply missing in the CrossReferenceTable.CrossReferenceTable. It turned out that the constructed CrossReferenceTable could miss Stream-parts if there were multiple Table-parts because a stream will only be added if it's associated with the very first Table-part. The remedy would seem to be to check for and add streams that are associated with any of the Table-parts, not just the first one. On a sample of 72 files where this failed, this changed fixed the StructTree for all of them.

BobLd · 2025-09-28T20:31:57Z

@ricflams thanks a lot for sharing the insight, even if outdated now. @EliotJones worked on the refactoring of this part of the code, so ill let him have a look first.

If he does not have time, ill try to find the time to take care of the change. In this context, if you could share files that are failing (the 1%), that'd be very helpful.

That being said, if you i have time to look into your fix in the context of the new code, that'd be amazing. Thanks again!

ricflams · 2025-09-28T22:58:57Z

Hey and thanks for the quick reply

if you could share files that are failing (the 1%), that'd be very helpful.

I included 4 pdfs which should get you (Eliot) going, I think. But let me know if you need more.

I ran through about 10K pdfs and 84 failed. This incomplete xref was the reason for 72 of them and the remaining 12 looked legit broken: eg xref-pointers into nowhere etc. And PAC (I didn’t try Acrobat) also complained those 12 were broken. So it really did seem like it was just fixing “all the problems”.

When cheking more files, like a sample of 100K pdfs, I’ve generally seen PdfPig doing really well at parsing - as good as PAC and Acrobat, it feels like.

That being said, if you i have time to look into your fix in the context of the new code, that'd be amazing. Thanks again!

Maybe I do in a week. I took a quick look after I realized the old xref-table-code wasn’t in play anymore and it just looked so completely different with the bruteforce-approach that it would take me a some time to see where it went wrong with the stream-parts. And I figured you guys might know just what to do, once you were made aware of the issue and had some sample files to look at. And, it was late 😉

ricflams · 2025-09-29T11:29:12Z

@BobLd fyi I took a look this morning and spotted how to fix it in the refactored xref-reading-code. I’ll update this PR with that fix instead later today.

- If an XrefTable has an associated stream, as indicated via the XrefStm-property, then read and add that XrefStream - Any table can have 0 or 1 such associated streams - A caveat: such an associated stream might also theoretically be part of the Parts-sequence in which case it would be encountered both by looping through all those parts along with all the regular tables and now also by association to any of those tables. It doesn't seem harmful since the offsets are flattened eventually anyway and stored by their offset-key into a mapping-table.

With the fix for including associated streams, this test now finds more text on the first page. I've verified using Aspose.PDF and by viewing the ErcotFacts.pdf file being tested that yes, it was indeed missing part of the text before.

Page two has had four more characters added, which is now delected by this xref-stream fix

Including the stream-xref means that the formerly missing font is no longer missing, so simply run the two test-cases under the (stricter) assumption of SkipMissingFonts=false.

BobLd · 2025-09-30T17:35:38Z

Thanks a lot @ricflams ! Great stuff

ricflams added 2 commits September 29, 2025 17:46

Merge branch 'master' into master

15dd04a

ricflams changed the title ~~Add xref-streams tied to any parts, not just the first~~ Bugfix: xref-streams were not added Sep 29, 2025

ricflams added 3 commits September 30, 2025 15:54

Update test Issue874: Also more text on page 2

78d80fa

Page two has had four more characters added, which is now delected by this xref-stream fix

Update test Issue874: No longer missing a font

4a4ede7

Including the stream-xref means that the formerly missing font is no longer missing, so simply run the two test-cases under the (stricter) assumption of SkipMissingFonts=false.

BobLd merged commit d7d01f8 into UglyToad:master Sep 30, 2025
2 checks passed

This was referenced Nov 23, 2025

Bump PdfPig from 0.1.11 to 0.1.12 yildirim-mehmet/onlineOfiice#7

Open

Bump PdfPig from 0.1.11 to 0.1.12 dotnet-presentations/ai-workshop#283

Open

This was referenced Dec 1, 2025

Bump PdfPig from 0.1.11 to 0.1.12 EvotecIT/OfficeIMO#1385

Open

Bump PdfPig from 0.1.11 to 0.1.12 MjrTom/PDF2MD#49

Open

Bump the nuget-all group with 9 updates magico13/MagiCloud#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bugfix: xref-streams were not added #1173

Bugfix: xref-streams were not added #1173

Uh oh!

ricflams commented Sep 26, 2025 •

edited

Loading

Uh oh!

BobLd commented Sep 28, 2025

Uh oh!

ricflams commented Sep 28, 2025 •

edited

Loading

Uh oh!

ricflams commented Sep 29, 2025

Uh oh!

BobLd commented Sep 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bugfix: xref-streams were not added #1173

Bugfix: xref-streams were not added #1173

Uh oh!

Conversation

ricflams commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BobLd commented Sep 28, 2025

Uh oh!

ricflams commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ricflams commented Sep 29, 2025

Uh oh!

BobLd commented Sep 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ricflams commented Sep 26, 2025 •

edited

Loading

ricflams commented Sep 28, 2025 •

edited

Loading