Library-internal representation of floats #1594

MartinThoma · 2023-02-01T20:12:09Z

MartinThoma
Feb 1, 2023
Maintainer

PDF documents have lots of floats. PyPDF was parsing them as Decimal so far, but we are thinking about switching to float (IEEE 754, a double).

How do other libraries do this?

pubpub-zz · 2023-02-01T20:37:07Z

pubpub-zz
Feb 1, 2023
Maintainer

@MartinThoma
What do you think of this code ?
https://github.com/veraPDF/veraPDF-parser/blob/1fd115e97b7c3313f5f8b5083868ef61ce555dbd/src/main/java/org/verapdf/parser/BaseParser.java#L589

0 replies

pubpub-zz · 2023-02-01T20:44:21Z

pubpub-zz
Feb 1, 2023
Maintainer

@MartinThoma
Also:
https://www.oreilly.com/library/view/developing-with-pdf/9781449327903/ch01.html

0 replies

petervwyatt · 2023-02-02T00:13:09Z

petervwyatt
Feb 2, 2023

I am the co-Project Leader of ISO 32000 (the core PDF spec) and CTO at the PDF Association.

What used to be Annex C "Implementation Limits" in earlier editions of the PDF spec was removed a number of years ago because it was not vendor-neutral and only reflected the implementation choice of a single vendor on a single platform and at a point in time about 15 years ago. Obviously hardware and software have both changed in that period, and different implementations have different requirements.

Using doubles is common practice in PDF parsers however that has the usual problems of accumulated error, precision and accuracy and so you must also take care that calculations remain within an acceptable range. If you are not rendering then there is certainly less to worry about, but there are PDFs out there that do really silly things (like explicitly rescale very large content down to unit square (or less) and then scale it back up again for no reason - and expect perfect alignment of objects).

Note also that PDF now has several features where the expectation of 32-bit only integer values is incorrect/unreasonable or wrong: new crypto, geospatial features, measurement properties and movie activation to name a few. Depending on what kind of SW you are implementing these may also influence your design choices.

0 replies

MartinThoma · 2023-02-02T17:54:39Z

MartinThoma
Feb 2, 2023
Maintainer Author

PyMuPDF: https://discord.com/channels/770681584617652264/983871937711341618/1070764033529086103

2 replies

pubpub-zz Feb 2, 2023
Maintainer

I'm sorry, I can not read the link (no lounge) can you add a snapshot of their answer?

MartinThoma Feb 2, 2023
Maintainer Author

I think PyMuPDF works pretty well. That gives me confidence that this choice is ok (+ the comment by petervwyatt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Library-internal representation of floats #1594

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Library-internal representation of floats #1594

MartinThoma Feb 1, 2023 Maintainer

Replies: 4 comments · 2 replies

pubpub-zz Feb 1, 2023 Maintainer

pubpub-zz Feb 1, 2023 Maintainer

petervwyatt Feb 2, 2023

MartinThoma Feb 2, 2023 Maintainer Author

pubpub-zz Feb 2, 2023 Maintainer

MartinThoma Feb 2, 2023 Maintainer Author

MartinThoma
Feb 1, 2023
Maintainer

Replies: 4 comments 2 replies

pubpub-zz
Feb 1, 2023
Maintainer

pubpub-zz
Feb 1, 2023
Maintainer

petervwyatt
Feb 2, 2023

MartinThoma
Feb 2, 2023
Maintainer Author

pubpub-zz Feb 2, 2023
Maintainer

MartinThoma Feb 2, 2023
Maintainer Author