Avoid reading ahead and then seeking back #1189

rhuijben · 2025-10-16T14:52:09Z

This builds upon PR #1188 (=first commit of this PR).

Cleans up all builtin tokenizers to no longer read one byte too much but instead use the Peek() on the input bytes.

This allows simplifying logic in quite a few places where other parsers had to compensate for reading a byte early on. Or where parsers/tokenizers had to seek back to make things work.

Also make ReadsNextByte a constant virtual method instead of a hidden readonly variable in each parser.

BobLd · 2025-10-18T07:58:17Z

@rhuijben thanks a lot for both PR. Ill review them shortly

rhuijben · 2025-10-18T08:10:11Z

These are some basic cleanups for further parser improvements. I'm wondering how stable you want to keep the tokenizer api.

We can remove that initial byte and the 'readnextbyte' flag we pass everywhere to simplify things... The peek byte can handle that case.

If we extend the peek to be an optionally larger buffer (can be done without changing the tokenizer api) there are many other optimizations possible.. Including using optimized searches instead of per byte walks.

That can reduce at least ten to twenty percent of parsing time in the test suite..

I have some code locally that optimized the number tokenizer this way (the highest cpu user in many tests)

BobLd · 2025-10-18T08:32:14Z

@rhuijben I'm always happy to add performance improvements (I think ill add some benchmarking soon).

@EliotJones we might want to release 0.12 soon? (Before refactoring the tokenizer api)

rhuijben · 2025-10-18T10:02:00Z

If you do performance tests, be careful around which source you use. (Read: test both) Most tests load the whole file in ram before processing, while you want more like stream processing if doing server batches or really huge files.

Currently this is still necessary for good read throughput, but that is because the stream bytes don't do efficient proper buffering. And we seek back far too often.

I think I should be able to improve this quite a bit without touching any of the higher layers. Just not sure how many third party tokenizers use the current extension api... And want to remain stable.

We can also wrap the old api with a newer api if necessary, to keep the public api stable... Just not sure about the current guarantees.

BobLd · 2025-10-20T05:39:47Z

src/UglyToad.PdfPig/Parser/FileStructure/FirstPassParser.StartXref.cs

+
+            bytes.Seek(fetchFrom);
+
+            Span<byte> byteBuffer = new byte[bytes.Length - fetchFrom];   // TODO: Maybe use PoolArray?


byteBuffer max size is 1024 bytes, right? I believe this would be a good use of stackalloc - unless not possible

Could be. Not sure about proper limits for stackalloc. (Usually I would look at that for a few hundred bytes max. But pdf processing is not a generic library design, that should be able to run on very limited machines)

With the next patch set this would be in the look ahead buffer of the input bytes and there would be no copying at all.

see my comment in the PR and https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/stackalloc

BobLd · 2025-10-20T05:41:29Z

@rhuijben I've submitted a first comment, have a look when you can.

Also, do you mind rebasing this branch onto master as I've merge your previous PR (will be easier to review)?

…by using seek. Optimize the FirstPassParser to just fetch a final chunk before doing things char-by-char backwards.

rhuijben · 2025-10-20T10:24:13Z

[Just rebased. No changes yet]

BobLd · 2025-10-21T05:56:44Z

@rhuijben thansk a lot for that.

I've started the review, but can you explain why many public bool ReadsNextByte went from true to false? What's the reasoning behind that?

rhuijben · 2025-10-21T06:42:23Z

[I worked on quite a few parsers over the years]

The current parser model requires each parser and more importantly parser users to handle the first and last byte of what is parser different than all others.

E.g. before using the position/offset you have to know what the last parser did.

The input already hands you the next byte via the peek function so there is zero reason to make things hard on the caller, and we can just fix the parsers to do the right thing.

So before looking at improving the parsers individually, and the input bytes to provide a bit more info I fixed this assumption to at least work the same for all parsers in this PR.

If the parser api can be changed I would even recommend removing this flag completely. It makes it much too easy to introduce subtle offset and off-by-one bugs in corner cases.

This is exactly what this PR is all about.
Cleaning up some technical debt and preparing for a bigger refactor.

The parser changes in this PR can be reviewed one by one.
The other changes are cases where there were assumptions about some parser leaving the parser state somehow, that are now differentl

BobLd · 2025-10-21T18:08:34Z

@rhuijben thanks a lot for the clarity here.

I think it's fine to change the parser api, I'd just like to first release an official version (this does not mean you need to wait to do PR). @EliotJones if no objection, I'll do that early next month.

Regarding stackalloc, you can have a llok at https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/stackalloc

I do not believe an official limit can exist, but this is what MS does:

int length = 1000;
Span<byte> buffer = length <= 1024 ? stackalloc byte[length] : new byte[length];

If you are sure the array will be 1024 bytes at max, then stackalloc should be safe.

Let me know if you want to push this change, I'm happy to merge the PR as it is

BobLd · 2025-10-27T20:00:11Z

@rhuijben I finally managed to run some benchmarks:

Array

Code available https://github.com/BobLd/UglyToad.PdfPig.Benchmarks/tree/array

Memory stream

Code available https://github.com/BobLd/UglyToad.PdfPig.Benchmarks/tree/memory-stream

Given these numbers, I'm unsure if this is worth the added complexity. Do you have an opinion?

rhuijben · 2025-10-28T06:35:06Z

The real benefit is not in the current pr, but in further improvements later on.
The current version would only help with really dumb stream implementations and/or overaggressive virusscanners.

Moving to the other model allows extending the reader with a bigger peek buffer without overhead and that part will make the real difference.

Not seeking back will also allow wrapping smarter readers over other readers, so we don't have to read some parts (such as image streams) multiple times. This will make things more performant and use a lot less memory (copying).

I have some of this in a branch in my GitHub fork if you want to look at the direction I'm thinking.

(This branch is not PR ready. It changes most tests to do things from a stream instead of an in memory copy, to allow seeing the differences earlier on during my tests. But it shows some progress)

BobLd · 2025-10-28T06:48:24Z

@rhuijben Thanks again for the clarity - much appreciated.

I'll merge as-is and look into the work in your fork.

Thanks again

rhuijben · 2025-10-28T07:08:13Z

@BobLd I will see what parts of that branch are ready for new PRs, and I'll try if I can use your new tests to see what to 'atrack' first.

The VS2026 test versions expose more profiling features without needing the very expensive versions of VS, to allow finding the hot code paths in an easy way.

BobLd · 2025-10-28T07:11:25Z

@rhuijben thanks for that. Side note: the full tests that run upon merging now fail, if you can have a look. Thanks!

rhuijben · 2025-10-28T07:21:55Z

I will see where that happens. From he GitHub test output it looks like type1 font parse issue that isn't caught by the .net testrun.

rhuijben · 2025-10-28T08:53:34Z

I added a Debug.Assert(name != null); after reading the literal as the result was not checked. The previous code would just have used the null value, so removing that check will unbreak the build.

But it is probably interesting to see where the PDF processing fails.
I'm unable to download the blob with all these testcases locally. Not sure if I need something for that?
Can you send me the relevant file or help me on how to fetch it?

BobLd · 2025-10-29T07:54:55Z

@rhuijben yes it more complicated to download the documents than I expected. The website is there https://digitalcorpora.org/corpora/file-corpora/cc-main-2021-31-pdf-untruncated/

Reading at the documentation here, the document you are looking should be in

https://digitalcorpora.s3.amazonaws.com/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/0000-0999/0000.zip

I'll have a 2nd look tonight

rhuijben · 2025-10-29T09:10:35Z

Ok, can download the file now. Thanks for mangling the url to something readable... My local tests got different urls and all ending in an Amazon 404.

rhuijben · 2025-10-29T11:01:59Z

index aac323f3..6d42ca5e 100644
--- a/src/UglyToad.PdfPig.Fonts/Type1/Parser/Type1Tokenizer.cs
+++ b/src/UglyToad.PdfPig.Fonts/Type1/Parser/Type1Tokenizer.cs
@@ -72,8 +72,10 @@
                         case '/':
                             {
                                 bytes.MoveNext();
-                                TryReadLiteral(out var name);
-                                Debug.Assert(name != null);
+
+                                if (!TryReadLiteral(out var name))
+                                    name = ""; // Should not happen, but does
+
                                 return new Type1Token(name, Type1Token.TokenType.Literal);
                             }
                         case '<':

Restores the old behavior of just creating an empty name for no data and with that fixes the test. (Not sure if it is really the behavior we want)

BobLd · 2025-10-29T18:50:32Z

@rhuijben thanks for that. ~~Do you mind creating a pr with the fix?~~

EDIT: Just realising that this Assert should not impact the test run... It looks like it run in debug mode

EDIT 2: I've pushed a change to run these tests in release mode (#1195), the tests now pass.

rhuijben force-pushed the feat/no-read-ahead branch from 7c9019b to da98643 Compare October 16, 2025 14:53

BobLd reviewed Oct 20, 2025

View reviewed changes

Avoid a lot of seeks by making most tokenizers no longer read to far …

dd50546

…by using seek. Optimize the FirstPassParser to just fetch a final chunk before doing things char-by-char backwards.

rhuijben force-pushed the feat/no-read-ahead branch from da98643 to dd50546 Compare October 20, 2025 10:23

BobLd approved these changes Oct 28, 2025

View reviewed changes

BobLd merged commit e11dc6b into UglyToad:master Oct 28, 2025
2 checks passed

This was referenced Nov 23, 2025

Bump PdfPig from 0.1.11 to 0.1.12 yildirim-mehmet/onlineOfiice#7

Open

Bump PdfPig from 0.1.11 to 0.1.12 dotnet-presentations/ai-workshop#283

Open

This was referenced Dec 1, 2025

Bump PdfPig from 0.1.11 to 0.1.12 EvotecIT/OfficeIMO#1385

Open

Bump PdfPig from 0.1.11 to 0.1.12 MjrTom/PDF2MD#49

Open

Bump the nuget-all group with 9 updates magico13/MagiCloud#9

Open


		bytes.Seek(fetchFrom);

		Span<byte> byteBuffer = new byte[bytes.Length - fetchFrom]; // TODO: Maybe use PoolArray?

Avoid reading ahead and then seeking back #1189

Avoid reading ahead and then seeking back #1189

Uh oh!

Conversation

rhuijben commented Oct 16, 2025

Uh oh!

BobLd commented Oct 18, 2025

Uh oh!

rhuijben commented Oct 18, 2025

Uh oh!

BobLd commented Oct 18, 2025

Uh oh!

rhuijben commented Oct 18, 2025

Uh oh!

BobLd Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

rhuijben Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BobLd Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

BobLd commented Oct 20, 2025

Uh oh!

rhuijben commented Oct 20, 2025

Uh oh!

BobLd commented Oct 21, 2025

Uh oh!

rhuijben commented Oct 21, 2025

Uh oh!

BobLd commented Oct 21, 2025

Uh oh!

BobLd commented Oct 27, 2025

Array

Memory stream

Uh oh!

rhuijben commented Oct 28, 2025

Uh oh!

BobLd commented Oct 28, 2025

Uh oh!

Uh oh!

rhuijben commented Oct 28, 2025

Uh oh!

BobLd commented Oct 28, 2025

Uh oh!

rhuijben commented Oct 28, 2025

Uh oh!

rhuijben commented Oct 28, 2025

Uh oh!

BobLd commented Oct 29, 2025

Uh oh!

rhuijben commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhuijben commented Oct 29, 2025

Uh oh!

BobLd commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rhuijben Oct 20, 2025 •

edited

Loading

rhuijben commented Oct 29, 2025 •

edited

Loading

BobLd commented Oct 29, 2025 •

edited

Loading