Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

obj followed by endobj? #2

Open
faceless2 opened this issue Apr 30, 2021 · 9 comments
Open

obj followed by endobj? #2

faceless2 opened this issue Apr 30, 2021 · 9 comments

Comments

@faceless2
Copy link

Peter - this is really great, by the way. The only assertion I have an issue with in this PDF is this token sequence:

2 0 obj endobj

I don't think this meets the requirements of the spec: here's the text (identical in both PDF1.4 and ISO32K2:2020)

The definition of an indirect object in a PDF file shall consist of its object number and generation number (separated by white-space), followed by the value of the object bracketed between the keywords obj and endobj.

The "value of the object" is missing, so I'm not sure that's a valid sequence of tokens.

@petervwyatt
Copy link
Member

@faceless2 Thanks Mike!
You raise an interesting point.

I was focused on the statement under 7.3.9 Null object "An indirect object reference (see 7.3.10, "Indirect objects") to a nonexistent object shall be treated the same as a null object" so this was an attempt to make a form of "nonexistent object" while also having a fully valid xref, which I believe it is.

@faceless2
Copy link
Author

Aha, that makes more sense. That's a different test to the one I thought you were aiming at. The non-existent object case we usually see is simply a reference to an object that doesn't exist, eg 9999 0 R, but obviously the end result is the same - no valid object found at that location. Thanks for clarifying, I'll close then.

@gettalong
Copy link

@faceless2 Thanks Mike! You raise an interesting point.

I was focused on the statement under 7.3.9 Null object "An indirect object reference (see 7.3.10, "Indirect objects") to a nonexistent object shall be treated the same as a null object" so this was an attempt to make a form of "nonexistent object" while also having a fully valid xref, which I believe it is.

I'm with @faceless2 here: The object in question exists because it has an entry in the cross-reference table. However, it is invalid since its value is missing.

If you want an indirect object reference to a nonexistent object, there are two options:

  • Create an indirect object reference that is truly nonexistent, like 100 0 R, like @faceless2 already mentioned.

  • Append an incremental update, marking one of the existent objects as free. Thus all indirect object references to that object will become invalid and treated as null.

At least two tools, HexaPDF and qpdf, mark the file as invalid due to this.

@petervwyatt petervwyatt reopened this Feb 5, 2022
@petervwyatt
Copy link
Member

I'm not so interested in tools as I can drive trucks through those tools and almost any other tool... :-)

What would be far better is to get a point of a fully unambiguous definition of everything that we can be reasoned about mathematically. So I'm re-opening because this thread is going in interesting directions... and I may want to copy this across to the pdf-issues GitHub repo eventually since I'm fairly sure we could improve the spec wording!

So if we conclude that a syntactically valid PDF object requires obj <value> endobj then how should the following token sequences be considered:

  • obj %comment endobj?
    7.2.4 states "PDF processors shall treat comments as single white-space characters for the purposes of lexical conversion" so this then semantically reduces to obj <whitespace> endobj and is that any different?

  • obj null endobj?
    7.3.9 states "Specifying the null object as the value of a dictionary entry (7.3.7, "Dictionary objects") shall be equivalent to omitting the entry entirely." but that is a statement about the entry (key/value pair) so that this object does have a value although the entry containing that value is treated as tho' it is omitted entirely. But if the indirect reference is from an array and not a dict, then the array element would clearly need to be keep as the null object so that later array elements are not shuffled up the array...

  • obj <some-garbage-byte> endobj?
    so the indirect object has a byte that would have to be considered the value of the object at some level...

Thoughts?

@MatthiasValvekens
Copy link
Member

MatthiasValvekens commented Feb 5, 2022

fully unambiguous definition of everything that we can be reasoned about mathematically.

I have some thoughts about that. Just to give things names: let's say that PDFObject is the name of the type that all PDF objects inhabit, in whatever abstract model we end up with. I'd argue that PDFObject has to be a sum type, where the sum runs over the (finite) number of object types that the standard enumerates: Dictionary, Number, Null, etc. Whichever way one formalises that, the result should (IMO) entail that the token sequence that goes between obj and endobj encodes a syntactically valid object of one of those constituent types.

On those grounds, I'd consider your second example a syntactically valid PDF object definition. On the other hand, I don't think the first and third example make sense in that context: in my opinion, even assigning the Null type to an empty token sequence (after eliminating comments) or a sequence of garbage bytes is wrong. Assigning a type to an empty token sequence would mess up lots of things, including dictionary and array parsing.

Just my 2¢.

@gettalong
Copy link

So if we conclude that a syntactically valid PDF object requires obj <value> endobj then how should the following token sequences be considered:

  • obj %comment endobj?
    7.2.4 states "PDF processors shall treat comments as single white-space characters for the purposes of lexical conversion" so this then semantically reduces to obj <whitespace> endobj and is that any different?

A comment is not a valid PDF object. PDF objects are listed in section 7.3 of the spec, with 7.3.1 listing all eight basic object types. Comments are defined in section 7.2.3.

So your example would be equivalent to obj endobj and that is clearly missing a value.

  • obj null endobj?
    7.3.9 states "Specifying the null object as the value of a dictionary entry (7.3.7, "Dictionary objects") shall be equivalent to omitting the entry entirely." but that is a statement about the entry (key/value pair) so that this object does have a value although the entry containing that value is treated as tho' it is omitted entirely. But if the indirect reference is from an array and not a dict, then the array element would clearly need to be keep as the null object so that later array elements are not shuffled up the array...

The dictionary case is how such things are normally handled in programming languages. For example, in Ruby you have hashes and if there is no value for a key, nil gets returned which is the equivalent to the PDF null object:

h = {}
h[:key] = :value
p h[:key]    # => :value
p h[:unknown]   # => nil

So having a key with nil/null as value gets the same result as not having that key at all. There is a semantic difference that is sometimes used, e.g. when one wants to know whether a key was explicitly set to the nil/null value. But that is, I think, of no interest in the PDF case.

And yes, in the case of an array the null object would need to be kept in order to keep the indices correct. Again, this is just how such things work and this feature is also used in the PDF spec, see section "12.3.2.2 Explicit Destinations".

  • obj <some-garbage-byte> endobj?
    so the indirect object has a byte that would have to be considered the value of the object at some level...

Again, we would need to have one of the PDF objects listed in section 7.3 for this to be a valid indirect object.

The garbage you are referring to would be treated as a token, just like 'obj' is a token or << is a token. However, it is not a valid PDF object like (string).

@faceless2
Copy link
Author

The difference between obj endobj, obj %comment endobj and obj null endobj is pretty academic I expect, a whether valid or invalid they'll all have the same end result, a missing object in the xref. I'm aware that valid/invalid is a determination we need to make for PDF/A, but it has little practical effect outside that.

I'm not so interested in tools as I can drive trucks through those tools and almost any other tool... :-)

Fair call, and I agree. But more interesting than determining whether a sequence is valid, is determining how to parse it when it's not.

What do we do if length doesn't match the parsed length? At what point we terminate parsing if there's trailing noise? Sure it's invalid, but standardising (or at least recommending) an approach for recovery is the big prize in my opinion. Although perhaps it's because it feels like I've spent much of the last 20 years in this gray area...

@gettalong
Copy link

What do we do if length doesn't match the parsed length? At what point we terminate parsing if there's trailing noise? Sure it's invalid, but standardising (or at least recommending) an approach for recovery is the big prize in my opinion. Although perhaps it's because it feels like I've spent much of the last 20 years in this gray area...

It's a funny coincidence that I find out about the SafeDocs project, the Arlington PDF Model and so on a few days after I started a project to document the ways PDF documents can be invalid and how to handle them - see https://github.com/gettalong/annotated-pdf-spec/

@petervwyatt
Copy link
Member

@gettalong I would also strongly encourage you to join ISO TC 171 SC 2 if you can, either via your national body (if it is a participating member country) or via the PDF Association (as it has an official liaison with ISO). There are many other discussions that occur within ISO that are restricted by ISO in what can be said publicly - you may infer some activities from https://www.pdfa.org/iso-status/ but there are others. The PDF Association also has many Technical Working Groups all working across various topics that is often input into ISO. If you want to discuss more, feel free to DM me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants