Skip to content

Add CoreTokenScanner.ClearPreReadByte() and further fix #1332#1334

Merged
BobLd merged 1 commit into
UglyToad:masterfrom
BobLd:issues/1332-2
Jun 19, 2026
Merged

Add CoreTokenScanner.ClearPreReadByte() and further fix #1332#1334
BobLd merged 1 commit into
UglyToad:masterfrom
BobLd:issues/1332-2

Conversation

@BobLd

@BobLd BobLd commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Commit 9c9cb41 added % to the PlainTokenizer break set so a comment immediately following a keyword/operator terminates the token, which matches ISO 32000-2 §7.2.3–7.2.4 and pdfbox's BaseParser.isEndOfName (which likewise treats % as a delimiter).

That change exposed a latent bug: PdfPig's CoreTokenScanner keeps a persistent look-ahead byte (hasBytePreRead) that was not discarded on Seek, so after jumping to an object's xref offset the scanner began reading one byte late and failed, triggering a spurious brute-force scan during the still-NoOp encryption phase that cached objects undecrypted.

The fix adds CoreTokenScanner.ClearPreReadByte() and calls it after the seeks in PdfTokenScanner.Get and TryBruteForceFileToFindReference, so the next read starts exactly at the sought byte. This mirrors pdfbox, whose parseFileObject does source.seek(objOffset) and immediately reads from that offset using read-then-rewind(1) look-ahead, meaning no byte is ever carried across a seek.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant