move file parsing to single-pass static methods #1102

EliotJones · 2025-07-20T23:18:04Z

for the file 0002973.pdf in the test corpus we need to completely overhaul how initial xref parsing is done since we need to locate the xref stream by brute-force and this is currently broken. i wanted to take this opportunity to change the logic to be more imperative and less like the pdfbox methods with instance data and classes.

currently the logic is split between the xref offset validator and parser methods and we call the validator logic twice, followed by brute-force searching again in the actual parser. we're going to move to a single method that performs the following steps:

find the first (from the end) occurrence of "startxref" and pull out the location in bytes. this will also support "startref" since some files in the wild have that
go to that offset if found and parse the chain of tables or streams by /prev reference
if any element in step 2 fails then we perform a single brute-force over the entire file and like pdfbox treat later in file-length xrefs as the ultimate arbiter of the object positions. while we do this we potentially can capture the actual object offsets since the xref positions are probably incorrect too.

the aim with this is to avoid as much seeking and re-reading of bytes as possible. while this won't technically be single-pass it gets us much closer. it also removes the more strict logic requiring a "startxref" token to exist and be valid, since we can repair this by brute-force anyway.

we will surface as much information as possible from the static method so that we could in future support an object explorer ui for pdfs.

this will also be more resilient to invalid xref formats with e.g. comment tokens or missing newlines.

for the file 0002973.pdf in the test corpus we need to completely overhaul how initial xref parsing is done since we need to locate the xref stream by brute-force and this is currently broken. i wanted to take this opportunity to change the logic to be more imperative and less like the pdfbox methods with instance data and classes. currently the logic is split between the xref offset validator and parser methods and we call the validator logic twice, followed by brute-force searching again in the actual parser. we're going to move to a single method that performs the following steps: 1. find the first (from the end) occurrence of "startxref" and pull out the location in bytes. this will also support "startref" since some files in the wild have that 2. go to that offset if found and parse the chain of tables or streams by /prev reference 3. if any element in step 2 fails then we perform a single brute-force over the entire file and like pdfbox treat later in file-length xrefs as the ultimate arbiter of the object positions. while we do this we potentially can capture the actual object offsets since the xref positions are probably incorrect too. the aim with this is to avoid as much seeking and re-reading of bytes as possible. while this won't technically be single-pass it gets us much closer. it also removes the more strict logic requiring a "startxref" token to exist and be valid, since we can repair this by brute-force anyway. we will surface as much information as possible from the static method so that we could in future support an object explorer ui for pdfs. this will also be more resilient to invalid xref formats with e.g. comment tokens or missing newlines.

src/UglyToad.PdfPig/Parser/FileStructure/FileHeaderOffset.cs

src/UglyToad.PdfPig/Parser/FileStructure/IXrefSection.cs

src/UglyToad.PdfPig/Parser/FileStructure/XrefStreamParser.cs

BobLd · 2025-08-09T10:53:01Z

@EliotJones don't hesitate if you don't agree with my feedback, or if you need help

EliotJones · 2025-08-09T16:09:20Z

@BobLd all good I just haven't had time to get back to this yet. I also need to check how it handles the case where the file contains object streams.

EliotJones · 2025-09-01T11:46:38Z

@BobLd I've handled a couple of feedback items now and added the test I wanted to include. Let me know if you have any further feedback.

EliotJones added 5 commits July 20, 2025 18:17

move more parsing to the static classes

196307a

plumb through the new parsing results

be0086e

plug in new parser and remove old classes, port tests to new classes

fa1a022

update tests to reflect logic changes

c20021d

EliotJones mentioned this pull request Jul 25, 2025

rework numeric tokenizer hot path #1104

Merged

EliotJones added 6 commits August 3, 2025 13:03

apply correction when file header has offset

f471137

merge latest master

78af0e3

ignore console runner launch settings

037b79c

skip offsets outside of file bounds

4409998

fix parsing tables missing a line break

48b3c26

use brute forced locations if they're already present

8ef88bb

EliotJones requested a review from BobLd August 3, 2025 21:29

EliotJones marked this pull request as ready for review August 3, 2025 21:29

EliotJones changed the title ~~WIP: move file parsing to single-pass static methods~~ move file parsing to single-pass static methods Aug 3, 2025

only treat line breaks and spaces as whitespace for stream content

f84caa0

BobLd requested changes Aug 4, 2025

View reviewed changes

src/UglyToad.PdfPig/Parser/FileStructure/FileHeaderOffset.cs Show resolved Hide resolved

src/UglyToad.PdfPig/Parser/FileStructure/IXrefSection.cs Outdated Show resolved Hide resolved

src/UglyToad.PdfPig/Parser/FileStructure/XrefStreamParser.cs Outdated Show resolved Hide resolved

address review comments

8def2d0

Merge branch 'master' into redesign-of-initial-file-parsing

ab5517b

BobLd merged commit 0afe021 into master Sep 2, 2025
2 checks passed

BobLd deleted the redesign-of-initial-file-parsing branch September 2, 2025 18:41

EliotJones mentioned this pull request Sep 7, 2025

File buffering read stream investigation #1140

Merged

This was referenced Nov 23, 2025

Bump PdfPig from 0.1.11 to 0.1.12 yildirim-mehmet/onlineOfiice#7

Open

Bump PdfPig from 0.1.11 to 0.1.12 dotnet-presentations/ai-workshop#283

Open

This was referenced Dec 1, 2025

Bump PdfPig from 0.1.11 to 0.1.12 EvotecIT/OfficeIMO#1385

Open

Bump PdfPig from 0.1.11 to 0.1.12 MjrTom/PDF2MD#49

Open

Bump the nuget-all group with 9 updates magico13/MagiCloud#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

move file parsing to single-pass static methods #1102

move file parsing to single-pass static methods #1102

Uh oh!

EliotJones commented Jul 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BobLd commented Aug 9, 2025

Uh oh!

EliotJones commented Aug 9, 2025

Uh oh!

EliotJones commented Sep 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

move file parsing to single-pass static methods #1102

move file parsing to single-pass static methods #1102

Uh oh!

Conversation

EliotJones commented Jul 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BobLd commented Aug 9, 2025

Uh oh!

EliotJones commented Aug 9, 2025

Uh oh!

EliotJones commented Sep 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants