Public PdfScanner Accessor on Page#1305
Conversation
| @@ -1,4 +1,4 @@ | |||
| <Project Sdk="Microsoft.NET.Sdk"> | |||
| <Project Sdk="Microsoft.NET.Sdk"> | |||
There was a problem hiding this comment.
can you undo changes in this document?
|
@jeske I have added a comment. Side question, what pdf renderer are you using? |
|
@jeske given the token scanner is shared across pages, making it available at page level might be a bit misleading. What about at PdfDocument level? Possibly just making it public in the Structure object PdfPig/src/UglyToad.PdfPig/Structure.cs Line 14 in 450b855 |
That's sensible. ill submit another patch. Thanks! The pdf renderer is my own. Intent is 99% managed pdf viewer to (mostly) eliminate common buffer overrun security vulnerabilites. And its also obscenely fast. Pdfpig parsing, wrapped my own 2d toolkit called Fluid, and forked SilkyNvg that i added a veldrid backend to (and soon CFF). |
|
closing as this is superceded by - #1308 |
TLDR; I am use PdfPig in a PDF renderer, and in order to render embedded XForm objects, I had to expose PdfScanner.
PdfPig Fork: Public
PdfScannerAccessor onPageSummary
We added a public
IPdfTokenScanner PdfScannerproperty toUglyToad.PdfPig.Content.Page(commitb1a0abd9). This exposes the page's internal token scanner — the component responsible for resolving indirect object references within a PDF — to consumers of thePageobject.Why This Was Needed
The Problem: Rendering Embedded Form XObjects
PDF pages can contain Form XObjects — reusable content streams referenced via the
Dooperator (e.g., repeated logos, vector artwork, or template overlays). When our renderer (PdfPageView) encounters anInvokeNamedXObjectoperation, it needs to:Page Dictionary → /Resources → /XObject → /<name>/Matrixentry (also potentially an indirect reference)Each of these steps requires resolving PDF indirect references (e.g.,
12 0 R) back to their actual token values. That's whatIPdfTokenScannerdoes — it's the lookup table from object numbers to their resolved content.The Upstream Gap
PdfPig's
Pageclass held the scanner as aprivatefield (pdfScanner) and used it internally for its own operations (annotations, experimental access, etc.), but never exposed it to consumers. There was no public API to resolve arbitrary indirect references from a page's dictionary tree.Without scanner access, our renderer had no way to:
/Resources → /XObjectdictionary chain (entries are often indirect references)StreamTokenfor a Form XObject/Matrixarray from a Form XObject's stream dictionaryWhere We Use It
1.
SafePdfDocumentModel.ResolveFormXObject()(SafePdfDocumentModel.cs:94)This resolves the full chain of indirect references from the page dictionary down to the actual Form XObject stream, then decodes and parses it into renderable operations. Results are cached per
(page, xobjectName)for reuse across frames.2.
PdfPageView.ResolveFormXObject()(PdfPageView.cs:936)Same pattern as above — a fallback path used when no document model is available.
3. Form XObject
/MatrixResolution (PdfPageView.cs:683)After resolving the Form XObject stream, we need to check if it has a
/Matrixentry (a 6-element affine transform that positions the XObject content). This entry could itself be an indirect reference, so the scanner is needed here too.The Change (in the fork)
File:
src/UglyToad.PdfPig/Content/Page.csThis is a minimal, read-only property exposing the existing private field. No behavioral changes, no new allocations, no breaking changes to existing consumers.
Alternatives Considered
Page.GetImages()/ content stream API: PdfPig's internal rendering pipeline processes Form XObjects, but doesn't expose the parsed operations in a way our custom NanoVG-based renderer can consume. We need the rawIGraphicsStateOperationlist to drive our own graphics state machine.Related Fork Changes
7e0ac6da— MadeProcessOperationsvirtual inBaseStreamProcessor(upstream PR by BobLd), enabling custom stream processors to override Form XObject handling.217b776b— Fixed decode values in images (related to correct XObject rendering).