Skip to content

Conversation

@BobLd
Copy link
Collaborator

@BobLd BobLd commented Oct 23, 2025

Unfortunately the regression was not caught with the tests in the project.

I caught that when running tests in the https://github.com/BobLd/PdfPig.Rendering.Skia project.

3 tests are now failing over there:

  • UglyToad.PdfPig.Rendering.Skia.Tests.TestRendering.PdfPigSkiaTest(expectedImage: "AcroFormsBasicFields_1.png", pdfFile: "AcroFormsBasicFields.pdf", pageNumber: 1, scale: 2) Failed
  • UglyToad.PdfPig.Rendering.Skia.Tests.TestRendering.PdfPigSkiaTest(expectedImage: "FontMatrix-concat_1.png", pdfFile: "FontMatrix-concat.pdf", pageNumber: 1, scale: 2) Failed
  • UglyToad.PdfPig.Rendering.Skia.Tests.TestRendering.PdfPigSkiaTest(expectedImage: "caly-issues-58-2_1.png", pdfFile: "caly-issues-58-2.pdf", pageNumber: 1, scale: 2) Failed
image

I'm lacking time to add the tests to this projects, but wanted to at least fix that for now.

cc @rhuijben

@BobLd BobLd merged commit 3555521 into UglyToad:master Oct 23, 2025
2 checks passed
@BobLd BobLd deleted the fix-regression-stream-length branch October 23, 2025 18:08
@EliotJones
Copy link
Member

Thanks, I don't have time to review much anymore but I think it makes sense not to trust the declared /Length in most situations, people do things like putting the number of filters or length of the un-compressed data or some random number in there. It can improve the initial stream parse if it happens to be the right length but unfortunately you usually need to manually read the stream in every case.

@rhuijben
Copy link
Contributor

Thanks! I will look into this

@rhuijben
Copy link
Contributor

Do you need to drop the length check on both levels?

@BobLd
Copy link
Collaborator Author

BobLd commented Oct 29, 2025

@rhuijben that makes sense yes, ill push a pr for tgat

@rhuijben
Copy link
Contributor

@rhuijben that makes sense yes, ill push a pr for tgat

I see you pushed a fix for the flate filter.

What I asked was if you see the problems at both levels?
You removed the length trimming in two places. And with that last commit in three.

I'm trying to find where it is safe to do the trimming. Just assuming that whitespace is part of the stream is not always the right thing to do... but just trimming isn't either, or the tests wouldn't have failed.

@BobLd
Copy link
Collaborator Author

BobLd commented Oct 30, 2025

@rhuijben my bad - I misunderstood your earlier comment.

What I asked was if you see the problems at both levels?

I only saw issue at top level (in PdfExtensions.Decode(...)), but not at FlateFilter level. That being said, and in line with Eliot's message, I think we should not trust /Length, mainly for the reasons laid out by Eliot. Another reason I remove the trimming in the FlateFilter is that it seams to not be the correct place to trim - see below.

You removed the length trimming in two places. And with that last commit in three.

Yes, I believe this is safer for now, in line with the above.

I'm trying to find where it is safe to do the trimming. Just assuming that whitespace is part of the stream is not always the right thing to do... but just trimming isn't either, or the tests wouldn't have failed.

Sadly, I don't think it's going to be easy. Going back to the Pdf 2.0 specifications:

7.3.8.2 Stream extent
Every stream dictionary shall have a Length entry that indicates how many bytes of the PDF file are used for the stream’ s data. (If the stream has a filter, Length shall be the number of bytes of encoded data.)

From this, my understanding is that the check should be be done in the PdfExtensions.Decode(...) methods, before any decoding. Doing the check in the FlateFilter might happen after another decoding. I removed this trimming in this PR, as it was creating an issue.

One possible approach would be to try decoding the stream first without any trimming. If this decoding fails, we trim the stream and try decoding again... Another approach could be to check for whitespaces. More than open to discuss, and my understanding of the specifications might be wrong.

As a side note, this is a prime example that pdf readers do not fully enforce the specs and are more lenient. This is also why PdfPig has a lenient mode.

Below are the documents that were failing before this PR:
AcroFormsBasicFields.pdf
FontMatrix-concat.pdf
caly-issues-58-2.pdf

@BobLd
Copy link
Collaborator Author

BobLd commented Oct 30, 2025

I asked Perplexity AI about our discussion:

in the context of pdf specification, and using feedback you can find on the web, should the /length parameter of a stream be relied on?

The PDF specification requires that every stream dictionary include the /Length entry, which specifies the number of bytes for the stream's data (typically the encoded data if filters are applied). However, in practice and based on community and expert feedback, you should be cautious about relying unconditionally on /Length.12

Specification Requirements

  • The /Length parameter is mandatory in the stream dictionary according to the PDF 1.7 specification.31
  • This /Length can be a direct or indirect object. If it is indirect, parsers must resolve the reference to find the correct length value.4
  • The canonical way to read a stream is to use the /Length field to read that many bytes, then ensure that what follows is the literal keyword endstream.2

Practical Reliability Issues

  • In the real world, malformed or hand-crafted PDFs can have incorrect /Length entries. This is especially true in documents created by buggy generators or those trying to evade detection or analysis.5
  • If a parser relies solely on /Length, but /Length is wrong, the stream read will result in garbage, truncated data, or unexpected errors.
  • Community discussions and StackOverflow guidance recommend double-checking that the /Length value does not overrun the next endstream and endobj keywords, and to fail safely or recover gracefully if mismatches are found.5
  • Some PDF libraries perform sanity checks: if the /Length leads past endstream or causes parser errors, they will fall back to searching for endstream as a secondary safeguard.5

Best Practices

  • Always read via /Length first, as that's canonical per spec.
  • Implement secondary safety: if /Length is out-of-range or mismatches the endstream delimiter, treat as an error or attempt stream recovery to maintain robustness.
  • Never 100% trust /Length from untrusted or unpredictable sources without basic sanity checking.5

In summary: /Length should be used according to spec, but practically, do not blindly trust it without verifying the correct stream boundaries, especially when working with PDFs from varied or untrusted origins.125
67891011121314151617181920

Footnotes

  1. https://www.proteansec.com/exploit-development/pdf-file-format-basic-structure/ 2 3

  2. https://blog.didierstevens.com/2008/05/19/pdf-stream-objects/ 2 3

  3. https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf

  4. https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.0.pdf

  5. https://stackoverflow.com/questions/50325459/how-to-parse-a-binary-pdf-stream-of-unknown-length 2 3 4 5

  6. https://arxiv.org/pdf/2302.00169.pdf

  7. https://www.newoaks.ai/blog/understanding-gpt-4o-pdf-size-limit/

  8. https://www.ifrc.org/sites/default/files/IFRC_feedback-mechanism-with-communities_ok_web.pdf

  9. https://stackoverflow.com/questions/3712556/methods-of-parsing-large-pdf-files

  10. https://commandlinefanatic.com/cgi-bin/showarticle.cgi?article=art019

  11. https://cronfa.swan.ac.uk/Record/cronfa65159/Download/65159__29443__2f599ae6162346758e7d3d430730bc8c.pdf

  12. https://unstract.com/blog/pdf-hell-and-practical-rag-applications/

  13. https://en.wikipedia.org/wiki/PDF

  14. https://papers.ssrn.com/sol3/Delivery.cfm/d61731f9-a01c-4c43-9969-05576e4f9f41-MECA.pdf?abstractid=4853712\&mirid=1

  15. https://arxiv.org/html/2410.09871v1

  16. https://helpx.adobe.com/acrobat/desktop/create-documents/explore-advanced-conversion-settings/pdf-settings-overview.html

  17. http://edshare.soton.ac.uk/1233/86/PDFformatutorial-HelloWorld.pdf

  18. https://pdf-issues.pdfa.org/32000-2-2020/clause07.html

  19. https://cseweb.ucsd.edu/~jmcauley/pdfs/recsys21b.pdf

  20. https://developer.signalwire.com/rest/signalwire-rest/guides/datasphere/pdf-ingestion-best-practices/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants