Skip to content

Public PdfScanner Accessor on Page#1305

Closed
jeske wants to merge 1 commit into
UglyToad:masterfrom
ArtificialNecessity:master
Closed

Public PdfScanner Accessor on Page#1305
jeske wants to merge 1 commit into
UglyToad:masterfrom
ArtificialNecessity:master

Conversation

@jeske

@jeske jeske commented May 24, 2026

Copy link
Copy Markdown
Contributor

TLDR; I am use PdfPig in a PDF renderer, and in order to render embedded XForm objects, I had to expose PdfScanner.

image

PdfPig Fork: Public PdfScanner Accessor on Page

Summary

We added a public IPdfTokenScanner PdfScanner property to UglyToad.PdfPig.Content.Page (commit b1a0abd9). This exposes the page's internal token scanner — the component responsible for resolving indirect object references within a PDF — to consumers of the Page object.

Why This Was Needed

The Problem: Rendering Embedded Form XObjects

PDF pages can contain Form XObjects — reusable content streams referenced via the Do operator (e.g., repeated logos, vector artwork, or template overlays). When our renderer (PdfPageView) encounters an InvokeNamedXObject operation, it needs to:

  1. Navigate the page's resource dictionary: Page Dictionary → /Resources → /XObject → /<name>
  2. Resolve the XObject stream (which may be stored as an indirect reference)
  3. Read the Form XObject's /Matrix entry (also potentially an indirect reference)
  4. Decode the content stream and recursively render the operations

Each of these steps requires resolving PDF indirect references (e.g., 12 0 R) back to their actual token values. That's what IPdfTokenScanner does — it's the lookup table from object numbers to their resolved content.

The Upstream Gap

PdfPig's Page class held the scanner as a private field (pdfScanner) and used it internally for its own operations (annotations, experimental access, etc.), but never exposed it to consumers. There was no public API to resolve arbitrary indirect references from a page's dictionary tree.

Without scanner access, our renderer had no way to:

  • Walk the /Resources → /XObject dictionary chain (entries are often indirect references)
  • Resolve the StreamToken for a Form XObject
  • Read the /Matrix array from a Form XObject's stream dictionary

Where We Use It

1. SafePdfDocumentModel.ResolveFormXObject() (SafePdfDocumentModel.cs:94)

var scanner = page.PdfScanner;

// Navigate: Page Dictionary → /Resources → /XObject → /<name>
PdfExtensions.TryGet<DictionaryToken>(page.Dictionary, NameToken.Resources, scanner, out var resources);
PdfExtensions.TryGet<DictionaryToken>(resources, NameToken.Xobject, scanner, out var xobjectDict);
PdfExtensions.TryGet<StreamToken>(xobjectDict, xobjectName, scanner, out var xobjectStream);

This resolves the full chain of indirect references from the page dictionary down to the actual Form XObject stream, then decodes and parses it into renderable operations. Results are cached per (page, xobjectName) for reuse across frames.

2. PdfPageView.ResolveFormXObject() (PdfPageView.cs:936)

Same pattern as above — a fallback path used when no document model is available.

3. Form XObject /Matrix Resolution (PdfPageView.cs:683)

PdfExtensions.TryGet<ArrayToken>(formStream.StreamDictionary, NameToken.Matrix, _page!.PdfScanner, out var matrixToken)

After resolving the Form XObject stream, we need to check if it has a /Matrix entry (a 6-element affine transform that positions the XObject content). This entry could itself be an indirect reference, so the scanner is needed here too.

The Change (in the fork)

File: src/UglyToad.PdfPig/Content/Page.cs

/// <summary>
/// The PDF token scanner for resolving indirect references in this page's dictionary.
/// </summary>
public IPdfTokenScanner PdfScanner => pdfScanner;

This is a minimal, read-only property exposing the existing private field. No behavioral changes, no new allocations, no breaking changes to existing consumers.

Alternatives Considered

  • Reflection: Could access the private field via reflection, but fragile and slow in a per-frame render loop.
  • Re-opening the document with a custom parser: Would duplicate state and lose page-level caching.
  • Using PdfPig's built-in Page.GetImages() / content stream API: PdfPig's internal rendering pipeline processes Form XObjects, but doesn't expose the parsed operations in a way our custom NanoVG-based renderer can consume. We need the raw IGraphicsStateOperation list to drive our own graphics state machine.

Related Fork Changes

  • 7e0ac6da — Made ProcessOperations virtual in BaseStreamProcessor (upstream PR by BobLd), enabling custom stream processors to override Form XObject handling.
  • 217b776b — Fixed decode values in images (related to correct XObject rendering).

@@ -1,4 +1,4 @@
<Project Sdk="Microsoft.NET.Sdk">
<Project Sdk="Microsoft.NET.Sdk">

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you undo changes in this document?

@BobLd

BobLd commented May 25, 2026

Copy link
Copy Markdown
Collaborator

@jeske I have added a comment. Side question, what pdf renderer are you using?

@BobLd

BobLd commented May 26, 2026

Copy link
Copy Markdown
Collaborator

@jeske given the token scanner is shared across pages, making it available at page level might be a bit misleading. What about at PdfDocument level? Possibly just making it public in the Structure object

public class Structure

@jeske

jeske commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

@jeske given the token scanner is shared across pages, making it available at page level might be a bit misleading. What about at PdfDocument level? Possibly just making it public in the Structure object

public class Structure

That's sensible. ill submit another patch. Thanks!

The pdf renderer is my own. Intent is 99% managed pdf viewer to (mostly) eliminate common buffer overrun security vulnerabilites. And its also obscenely fast.

Pdfpig parsing, wrapped my own 2d toolkit called Fluid, and forked SilkyNvg that i added a veldrid backend to (and soon CFF).

@jeske

jeske commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

closing as this is superceded by - #1308

@jeske jeske closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants