This document describes some strict rules for the Arlington PDF model, for both the data and the predicates (custom declarative predicates that start fn:
). Only some of these rules are currently implemented by various PoCs, but everything is precisely documented here.
Note that the Arlington PDF Model accurately reflects the latest agreed ISO 32000-2:2020 PDF 2.0 specification (available for no-cost) and as amended by industry-agreed errata from https://pdf-issues.pdfa.org. If this state of affairs is unsuitable for adopters of the Arlington PDF Model (e.g. unresolved errata are causing issues for implementations) then the recommended practice is for those specific implementations to create private diff
patches against the model as it is entirely text-based.
- They are TSV, not CSV. Use tabs (
\t
). - No double quotes are used.
- Every TSV file needs to have the same identical header row as first line in file
- EOL rules for TSV are now set by
.gitattributes
to be LF -- standard Linux CLI works including under Windows WSL2:
cut
,grep
,sed
, etc. - this means you can also use all the Ebay TSV utilities even under Windows
- GNU datamash can also be used.
- standard Linux CLI works including under Windows WSL2:
- Every TSV file needs to have the full set of TABS (for all columns).
- Last row in TSV needs EOL after last TAB.
- TSV file names are case-sensitive.
- TSV file extensions are always
.tsv
(lowercase) but are not present in the TSV data itself. - all TSV files will have matching numbers of
[
,]
and(
,)
- for a single row in any TSV, splitting each field on ';' will either result in 1 or N.
- files that represent PDF arrays match either
ArrayOf*.tsv
,*Array.tsv
or*ColorSpace.tsv
- many are also identifiable by having a Key name of
0
(or0*
or*
)
grep "^0" *
- many are also identifiable by having a Key name of
- files that represent PDF 'map' objects (meaning that the dictionary key name can be anything) match
*Map.tsv
- note that CMaps are in
CMapStream.tsv
- note that CMaps are in
- NOT all files that are PDF stream objects match
*Stream.tsv
- since each Arlington object is fully self-contained, many objects can be streams. The best method is to search for
DecodeParms
key instead:
grep "^DecodeParms" * | tsv-pretty
- since each Arlington object is fully self-contained, many objects can be streams. The best method is to search for
- There are NO leading SLASHES for PDF names (ever!)
- PDF names don't use
#
-escaping (currently unsupported) - PDF strings use single quotes
'
and'
(since(
and)
are ambiguous with expressions and single quotes are supported natively by Pythoncsv
module) - Expressions with integers need to use integers. Integers can be used in place of numbers.
*
represents a wildcard (i.e. anything). Other regex are not supported. Wildcards can be used in the Key field and in the PossibleValues field for names (when the PDF standard specifically states that other arbitrary names can be used)- Leading
@
indicates "value of" a key or array element - PDF Booleans are
true
andfalse
lowercase.- Uppercase
TRUE
/FALSE
are reserved for logical Boolean TSV data fields such as the "Required" field.
- Uppercase
- expressions using
&&
or||
logical operators need to be either fully bracketed or be just a predicate and have a single SPACE either side of the logical operator. precedence rules are NOT implemented. - the predefined Arlington paths
parent::
andtrailer::
represent the parent of the current object and file trailer (either traditional or a cross-reference stream) respectively. All other paths are relative from the containing PDF object - PDF arrays always use
[
and]
(which may require some additional processing so as not to be confused with our [];[];[] syntax for complex fields)- elements in a PDF array do not use COMMA-separators and are specified just like in PDF e.g.
[0 1 0]
- if a PDF array needs to be specified as part of a complex typed key (
[];[];[]
) then 2 sets of[
and]
need to be used for the array values- e.g.
[[0 1]];[123];[SomeThing]
might be a Default Value for a PDF key that can be an array, an integer or a name (alphabetically sorted in the "Type" field!) each with a default value. - this extra pair of
[
and]
is only needed for complex types.
- e.g.
- elements in a PDF array do not use COMMA-separators and are specified just like in PDF e.g.
- A key or array element is so-called "complex" if it can be multiple values. This is represented by
[];[];[]
-type expressions. - Something is so called a "wildcard" if the "Key" field contains an ASTERISK.
- An array is so-called a "repeating array" if it requires N x a set of elements. This is represented by DIGIT+ASTERISK in the "Key" field.
- Repeating array elements with DIGIT+ASTERISK must be the last rows in a TSV
- e.g.
0*
1*
2*
would be an array of 3 * N triplets of elements - e.g.
0
1*
2*
would be an array of 2 * N + 1 elements, where the first element has a fixed definition, followed by repeating pairs of elements
- Must not be blank
- Case-sensitive (as per PDF spec)
- No duplicates keys in any single TSV file
- Only alphanumeric,
.
,-
,_
or ASTERISK characters (no whitespace or other special characters)- The proprietary Apple APPL extensions also use
:
(COLON) as inAAPL:ST
- The proprietary Apple APPL extensions also use
- If a dictionary, then "Key" may also be an ASTERISK
*
meaning wildcard, so anything is allowed - If ASTERISK
*
by itself then must be last row in TSV file - If ASTERISK
*
by itself then "Required" column must be FALSE - If representing a PDF array, then "Key" name is really an integer array index.
- Zero-based increasing (always by 1) integers always starting at ZERO (0), with an optional ASTERISK appended after the digit (indicating repeat)
- Or just an ASTERISK
*
meaning that any number of array elements may exist
- If representing a PDF array with a repeating set of array elements (such as alternating pairs of elements) then use
digit+ASTERISK
where the last set of rows must all bedigit+ASTERISK
(indicating a repeating group of N elements starting at array element M (so array starts with a fixed set (non-repeating) array elements 0 to M-1, followed by the repeating set of element M to (M + N-1)) array elements). - If representing a PDF array with
digit+ASTERISK
then the "Required" column should be TRUE if all N entries must always be repeated as a full set (e.g. in pairs or quads). - Python pretty-print/JSON
- String (as JSON dictionary key)
- Linux CLI tests:
# List of all key names and array indices cut -f 1 * | sort -u
- files that define objects with an arbitrary number of keys or array elements use the wildcard
*
. If the line number of the wildcard is line 2 then it is a map-like object. If the line number is after 2, then are additional fixed keys/elements.grep --line-number "^\*" * | sed -e 's/\:/\t/g' | tsv-pretty
- files that define arrays with repeating sequences of N elements use the
digit+ASTERISK
syntax. Digit is currently restricted to a SINGLE digit 0-9.grep "^[0-9]\*" * | tsv-pretty
- Must not be blank
- Alphabetically sorted, SEMI-COLON separated list from the following predefined set of Arlington types (always lowercase):
array
bitmask
boolean
date
dictionary
integer
matrix
name
name-tree
null
number
number-tree
rectangle
stream
string
string-ascii
string-byte
string-text
- Each type may also be wrapped in a version-based predicate (e.g.
fn:SinceVersion(version,type)
,fn:Deprecated(version,type)
,fn:Extension(name,type)
, etc.). - When a predicate is used, the internal simple type is still kept in its alphabetic sort order
- The following predefined Arlington types ALWAYS REQUIRE a link:
array
,dictionary
,stream
- The following predefined Arlington types MAY have a link (this is because name and number trees can
have nodes which are the primitive Arlington types below or a complex type above):
name-tree
,number-tree
- e.g.
Navigator\Strings
is a name-tree of string objects
- The following predefined Arlington types NEVER have a link (they are the primitive Arlington types):
bitmask
,boolean
,date
,integer
,matrix
,name
,null
,number
,rectangle
,string
,string-ascii
,string-byte
,string-text
- Note that
null
is only an explicit type when mentioned in ISO 32000-2:2020.- Dictionary handling is covered by subclause 7.3.7 "A dictionary entry whose value is null (see 7.3.9, "Null object") shall be treated the same as if the
entry does not exist." so dictionaries will never have a
null
type unless ISO 32000-2 explicitly mentions it or there is a glitch in the matrix (e.g. Table 207 for Mac and Unix entries). - Array objects and name-tree and number-trees are more complex as ISO 32000-2:2020 makes no statements about
null
. See also Arlington Issue #90 and PDF 2.0 Errata #157.
- Dictionary handling is covered by subclause 7.3.7 "A dictionary entry whose value is null (see 7.3.9, "Null object") shall be treated the same as if the
entry does not exist." so dictionaries will never have a
- Python pretty-print/JSON:
- Always a list
- List elements are either:
- Strings for the basic types listed above
- Python lists for predicates - a simple search through the list for a match to the types above is sufficient (if understanding the predicate is not required)
- Not to be confused with "/Type" keys which is why the
[
is included in this grep! grep "'Type': \[" dom.json | sed -e 's/^ *//' | sort -u
- Linux CLI tests:
cut -f 2 * | sort -u cut -f 2 * | sed -e "s/;/\n/g" | sort -u
- Must not be blank
- Must resolve to one of
1.0
,1.1
, ...1.7
or2.0
- Can be a predicate such as
fn:Extension(...)
orfn:Eval(...)
- e.g.
fn:Extension(XYZ,2.0)
orfn:Eval(fn:Extension(XYZ,1.3) || 1.6)
- e.g.
- In the future the set of versions may be increased - e.g.
2.1
- Version-based predicates in other fields should all be based on versions explicitly AFTER the version in this column
- Python pretty-print/JSON
- Always a string (never blank!)
- Value is one of the values listed above
grep "'SinceVersion'" dom.json | sed -e 's/^ *//' | sort -u
- Linux CLI tests:
cut -f 3 * | sort -u
- Can be blank
- Must be one of
1.0
,1.1
, ...1.7
or2.0
- Version-based predicates in other fields should all be based on versions explicitly BEFORE the version in this column
- In the future:
- Set of versions may be increased - e.g.
2.1
- Set of versions may be increased - e.g.
- Python pretty-print/JSON
- A string or
None
- Value is one of the values listed above
grep "'Deprecated': " dom.json | sed -e 's/^ *//' | sort -u
- A string or
- Linux CLI tests:
cut -f 4 * | sort -u
- Must not be blank
- Either:
- Single word:
FALSE
orTRUE
(uppercase only) - The predicate
fn:IsRequired(...)
- no SQUARE BRACKETS!- This may then have further nested predicates (e.g.
fn:SinceVersion
,fn:IsPresent
,fn:Not
)
- This may then have further nested predicates (e.g.
- Single word:
- If "Key" column contains ASTERISK (as a wildcard), then "Required" field must be FALSE
- Cannot require an infinite number of keys! If need at least one element, then have explicit first rows
with "Required"==
TRUE
followed by ASTERISK with "Required"==FALSE
)
- Cannot require an infinite number of keys! If need at least one element, then have explicit first rows
with "Required"==
- Python pretty-print/JSON:
- Always a list
- List length is always 1
- List element is either:
- Boolean
- Python list for predicates which must be
fn:IsRequired(
grep "'Required': " dom.json | sed -e 's/^ *//' | sort -u
- Linux CLI tests:
cut -f 5 * | sort -u
- Must not be blank
- Streams must always have "IndirectReference" as
TRUE
- For name- and number-trees, the value represents what the direct/indirect requirements of the values of tree (e.g. if it is a stream, it would be
TRUE
) - Either:
- Single word:
FALSE
orTRUE
(uppercase only, as it is not a PDF keyword!); or - Single predicate
fn:MustBeDirect()
orfn:MustBeIndirect()
indicating that the corresponding key/array element must be a direct object or not [];[];[]
style expression - SEMI-COLON separated, SQUARE-BRACKETS expressions that exactly match the number of items in the "Type" column. Only the valuesTRUE
orFALSE
can be used inside each[...]
.- A more complex set of requirements using the predicate
fn:MustBeDirect(optional-key-path>)
orfn:MustBeIndirect(...)
NOT enclosed in SQUARE-BRACKETS
- Single word:
- Python pretty-print/JSON:
- Always a list
- List length always matches length of "Type" column
- List elements are either:
- Python Boolean (
True
/False
) - Python list for predicates where the outer-most predicate must be
fn:IsRequired(
, with an optional argument for a condition
- Python Boolean (
grep "'IndirectReference':" dom.json | sed -e 's/^ *//' | sort -u
- Linux CLI tests:
cut -f 6 * | sort -u
- Must not be blank
- Single word:
FALSE
orTRUE
(uppercase only, as it is not a PDF keyword!) - Python pretty-print/JSON:
- Always a boolean
grep "'Inheritable'" dom.json | sed -e 's/^ *//' | sort -u
- Linux CLI tests:
cut -f 7 * | sort -u
- Represents a default value for the PDF key/array element. As such it is always a single value for each Type.
- see "PossibleValues" field below for when multiple values need to be specified.
- Can be blank
- SQUARE-BRACKETS are also used for PDF arrays, in which case they must use double SQUARE-BRACKETS if part of a complex type (not that lowercase
true
/false
are the PDF keywords). If the array is the only valid type, then single SQUARE-BRACKETS are used. PDF array elements are NOT separated with COMMAs.- e.g.
[[false false]];[123]
vs[false false]
- thus a complex expression can first be split by SEMI-COLON, then each portion has the SQUARE-BRACKETS stripped off - any remaining SQUARE-BRACKETS indicate an array.
- e.g.
- If there is a "DefaultValue" AND there are multiple types, then require a complex
[];[];[]
expression- If the "DefaultValue" is a PDF array as part of a complex type, then this will result in nested SQUARE-BRACKETS as in
[];[[0 0 1]];[]
- If the "DefaultValue" is a PDF array as part of a complex type, then this will result in nested SQUARE-BRACKETS as in
- The only valid predicates are:
fn:ImplementationDependent()
, orfn:DefaultValue(condition, value)
where value must match the appropriate type (e.g. an integer for an integer key, a string for a string-* key, etc), orfn:Eval(expression)
- Predicates only need [];[];[] expression if a multi-typed key
- Python pretty-print/JSON:
- A list or
None
- If list, then length always matches length of "Type"
- If list element is also a list then it is either:
- Predicate with 1st element being a FUNC_NAME token
- "Key" value (
@key
) with 1st element being a KEY_VALUE token - A PDF array (1st token is anything else) - including an empty PDF array
grep -o "'DefaultValue': .*" dom.json | sed -e 's/^ *//' | sort -u
- A list or
- Linux CLI tests:
cut -f 8 * | sort -u cut -f 2,8 * | sort -u | grep -P "\t[[:graph:]]+.*" | tsv-pretty cut -f 1,2,8 * | sort -u | grep -P "\t[[:graph:]]*\t[[:graph:]]+.*$" | tsv-pretty
- Can be blank
- SQUARE-BRACKETS are only required for complex types. A single type does not use them.
- e.g.
12.34
is a valid default for a key which can only be anumber
- e.g.
- SEMI-COLON separated, SQUARE-BRACKETS expressions that exactly match the number of items in "Type" column
- SQUARE-BRACKETS are also used for PDF arrays, in which case they must use double SQUARE-BRACKETS if part of a complex type. If the array is the only valid type, then single SQUARE-BRACKETS are used. PDF array elements are NOT separated with COMMAs - they are only used between arrays.
- e.g.
[[0 1],[1 0]];[Value1,Value2,Value3]
is a choice of 2 arrays[0 1]
and[1 0]
if the type is an array or a choice ofValue1
orValue2
orValue3
if the type was something else (e.g. name) - thus a complex expression can first be split by SEMI-COLON, then each portion has the SQUARE-BRACKETS stripped off, then multiple options can be split by COMMA as any remaining SQUARE-BRACKETS indicate an array.
- e.g.
- If there is a "PossibleValues" AND there are multiple types, then require a complex
[];[];[]
expression- If the "PossibleValues" is a PDF array as part of a complex type, then this will result in nested SQUARE-BRACKETS as in
[];[[0 0 1]];[]
- If the "PossibleValues" is a PDF array as part of a complex type, then this will result in nested SQUARE-BRACKETS as in
- For keys or arrays that are PDF names, a wildcard
*
indicates that any arbitrary name is explicitly permitted according to the PDF specification along with formally defined values (e.g. OptContentCreatorInfo, Subtype key:[Artwork,Technical,*]
).- Do not use
*
as the only value - since an empty cell has the same meaning as "anything is OK" although there is some subtle nuances regarding whether custom keys have to be 2nd class names or can be really anything. See Errata #229 - The wildcard must be the LAST entry in the list of names and, because it cannot be alone, it will always be preceded by a COMMA. This may occur in complex forms too such as
[...];[...,*];[...]
. - The TestGrammar PoC will not report an error about unexpected values in this case unless the
--explicit-values-only
CLI option. - To locate all such uses in the Arlington model, search for
,*]
:grep ",\*]" *.tsv
- Do not use
fn:Eval
predicate wrapper is only needed for predicates which need to perform calculations.fn:Eval
is not required around the version-based predicates (which includesfn:Extension
) or expressions usingfn:RequiredValue
- Python pretty-print/JSON:
- A list or
None
- If list, then length always matches length of "Type"
- Elements can be anything, including
None
- Elements can be anything, including
grep -o "'PossibleValues': .*" dom.json | sed -e 's/^ *//' | sort -u
- A list or
- Linux CLI tests:
cut -f 9 * | sort -u
- Can be blank
- SEMI-COLON separated, SQUARE-BRACKETED complex expressions that exactly match the number of items in "Type" column
- Each expression inside a SQUARE-BRACKET is a predicate that reduces to TRUE/FALSE or is indeterminable.
- TRUE means that it is a valid, FALSE means it would be invalid.
- A SpecialCase predicate is not meant to reflect all rules from the PDF specification (things are declarative, not programmatic!)
- It should not test for required/optional-ness, whether an object is indirect or not, etc. as those rules should live in the other fields
- Python pretty-print/JSON:
- A list or
None
- If list, then length always matches length of "Type"
- Elements can be anything, including
None
- Elements can be anything, including
grep -o "'SpecialCase': .*" dom.json | sed -e 's/^ *//' | sort -u
- A list or
- Can be blank (but only when "Type" is a single basic type)
- If non-blank, always uses SQUARE-BRACKETS
- SEMI-COLON separated, SQUARE-BRACKETED complex expressions that exactly match the number of items in "Type" column
- Valid "Links" must exist for these selected object types only:
array
dictionary
stream
name-tree
- the value represents the node in the tree, not how trees are specifiednumber-tree
- the value represents the node in the tree, not how trees are specified
- "Links" must NOT exist for selected fundamental "Types" (i.e. must be empty
[]
in the SEMI-COLON separated list):array
bitmask
boolean
date
integer
matrix
name
null
number
rectangle
string
string-ascii
string-byte
string-text
- Each sub-expression inside a SQUARE-BRACKET is a COMMA separate list of case-sensitive filenames of other TSV files (without
.tsv
extension) - These sub-expressions MUST BE one of these version-based predicates:
fn:SinceVersion(pdf-version,link)
fn:SinceVersion(pdf-version,fn:Extension(name,link))
fn:IsPDFVersion(pdf-version,fn:Extension(name,link))
fn:Deprecated(pdf-version,link)
fn:BeforeVersion(pdf-version,link)
fn:IsPDFVersion(version,link)
- Python pretty-print/JSON:
- A list or
None
- If list, then length always matches length of "Type"
- List elements can be
None
- Validity of list elements aligns with indexed "Type" data
- List elements can be
- A list or
- Linux CLI test:
# A list of all predicates used in the Link field (column 11) cut -f 11 * | sort -u | grep -o "fn:[a-zA-Z0-4]*" | sort -u
- Can be blank
- Free text - no validation possible
- Often contains a reference to Table(s) (search for case-sensitive "Table ") or clause number(s) (search for case-sensitive "Clause ") from ISO 32000-2:2020 (PDF 2.0) or a PDF Association Errata issue link (as GitHub URL) where the Arlington machine-readable definition is defined.
- For dictionaries, this is normally on the first key on the
Type
orSubtype
row depending on what is the primary differentiating definition - Note that this is not where an object is referenced from, but where its key and values are defined. Sometimes this is within body text prose of ISO 32000-2:2020 (so outside a Table and a Clause reference is used) or as prose within the "Description" cell of some other key in another Table. Where an object is referenced is encoded by the Arlington PDF Model "Link" field - just grep for the case-sensitive TSV file (no extension)!
- For dictionaries, this is normally on the first key on the
- The spreadsheet Arlington-vs-ISO32K-Tables.xlsx provides a cross reference from all mentions of "Table" within the Arlington PDF Model against the an index of every Table in ISO 32000-2:2020 as published by ISO. Tables that are not mentioned anywhere in Arlington TSV files may indicate poor coverage in the Arlington PDF Model - or that the table is inappropriate for incorporating into the Arlington PDF Model.
- Current known limitations include no support for FDF; less-than-perfect definition for Linearization objects; and no definition of content streams.
- Note also that Arlington does additionally reference other ISO and Adobe publications, sometimes also with specific clause and Table references (such as for Adobe Extension Level 3).
- Python pretty-print/JSON:
- A string or
None
- A string or
- Linux CLI voodoo:
# Find all TSV files in a data set that do not have either a Table number or Clause reference grep -PL "(Table )|(Clause )" * # A list of most (but not all!) Table numbers referenced in an Arlington TSV file set. Does not capture Annex tables. grep --color=none -Pho "(?<=Table) [0-9]+" * | sort -un # Some PDF objects are defined by prose in clauses, rather than Tables grep -Pho "Clause [0-9A-H\.]*" * | sort -u # Find all ISO publication that are explicitly referenced grep -Pho "ISO[^_]*$" * | sort -u
First and foremost, the predicate system is not based on functional programming!
The best way to understand an expression with a predicate is to read it out aloud, from left to right. Its verbalization should relatively closely match wording found in the PDF specification. Predicate simplification is avoided so that wording (when read aloud) is kept as close as possible to wording in the PDF specification.
- the internal Arlington grammar is loosely typed (so things need to match or be interpreted as matching the "Type" field (column 2)).
- integers may be used in place of numbers (but not vice-versa!)
_parent::_
(all lowercase) is a special Arlington grammar keyword that forms the basis of a conceptual "relative" path in the PDF DOM. There can be multipleparent::
s._trailer::_
(all lowercase) is a special Arlington grammar keyword that forms the basis of a conceptual "absolute" path in the PDF DOM. Arlington always starts with the trailer, so that trailer keys and values can also be used in predicates.trailer::Catalog
is a special Arlington alias fortrailer::Root
, as the Root key in the trailer is the reference to the Document Catalog, however normal PDF terminology refers to the "Document Catalog" and so that commonly understood term is preferred over the ambiguous word "root" (as that could ambiguously mean either the trailer as the root or the Document Catalog as the root) - and reading aloud "Catalog" sounds more natural.- Either
trailer::Catalog
ortrailer::Root
can be used, but the preference istrailer::Catalog
because it verbalises better
- Either
null
(all lowercase) is the PDF null object (Note: it is also valid predefined Arlington type).null
gets used in "DefaultValue" or "PossibleValue" fields only when it is explicitly mentioned in the PDF specification.
Key
meanskey is present
(Key
is case-sensitive match and may include an Arlington path)@Key
meansthe value of key
(Key
is case-sensitive match and may include an Arlington path).- this also applies after a
path
- e.g.keyA::keyB::@keyC
is valid and is the value ofkeyC
when the pathKeyA::keyB
is traversed
- this also applies after a
- Arlington paths are separated by
::
(double COLONs)- e.g.
parent::@Key
.KeyA::KeyB
,trailer::Catalog::Size
,Object::<0-based integer>
- the
@
operator only applies to the right-most portion - The
@
sign is always required for math and comparison operations, since those operate on values.- if an array or stream length is needed then use the specific predicate
- The predefined Arlington types used with
@
are the primitive types such as boolean, integer, number, string-*, name, etc. - It is also possible to use the key name of an array for certain predicates such as
fn:Contains(...)
- For complex types, if the "DefaultValue" for KeyA is
@KeyB
then it means that the "default value for Key A is the value of Key B" and so long as Keys A and B both have the same type then this is logical.
- e.g.
true
andfalse
(all lowercase) are the PDF keywords (required for explicit comparison with@key
) - uppercaseTRUE
andFALSE
never get used in predicates as they represent Arlington model values such as for "Required", "IndirectReference" or "Inheritable" fields.- All predicates start with
fn:
(case-sensitive, single COLON) followed by an uppercase character (A
-'Z') - All predicate names are CamelCase case-sensitive with BRACKETS
(
and)
and do NOT use DASH or UNDERSCOREs (i.e. must match a simple alphanumeric regex) - Predicates can have 0, 1 or 2 arguments that are always COMMA separated
- Predicates need to end with
()
for zero arguments - Arguments always within
(...)
- Predicates can nest (as arguments of other Predicates)
- Predicates need to end with
- Support two C/C++ style boolean operators:
&&
(logical and),||
(logical or). There is also a specialfn:Not(...)
predicate. - Support six C/C++ style comparison operators: <. <=, >, >=, ==, !=
- NO bit-wise operators - use predicates instead
- NO unary NOT (
!
) operator (use predicatefn:Not(...)
) - All expressions MUST be fully bracketed between Boolean operators (to avoid defining precedence rules)
- NO conditional if/then, switch or loop style statements - its purely declarative!
- NO local variables - its purely declarative!
- Using comparison operators requires that the full expression is wrapped in
fn:Eval(...)
# List all predicates by names:
grep --color=always -ho "fn:[[:alnum:]]*" * | sort -u
# List all predicates and their Arguments
grep -Pho "fn:[a-zA-Z0-9]+\((?:[^)(]+|(?R))*+\)" * | sort -u
# List all predicates that take no parameters:
grep --color=always -Pho "fn:[a-zA-Z0-9]+\(\)" * | sort -u
# List all parameter lists (but not predicate names) (and a few PDF strings too!):
grep --color=always -Pho "\((?>[^()]|(?R))*\)" * | sort -u
# List all predicates with their arguments:
grep --color=always -Pho "fn:[a-zA-Z0-9]+\([^\t\]\;]*\)" * | sort -u
Any Linux command that outputs a row from an Arlington TSV data file can be piped through tsv-pretty
to improve readability.
# Pretty columnized output:
tsv-pretty Catalog.tsv
# Find all keys that are of "Type" 'string-byte':
tsv-filter -H --str-eq Type:string-byte *.tsv
# Only precisely 'string-byte':
tsv-filter -H --str-eq Type:string-byte --ge SinceVersion:1.5 *.tsv
# Any string type (using string-based regex):
tsv-filter -H --regex Type:string\* --ge SinceVersion:1.5 *.tsv
# "Type" includes 'string-byte':
tsv-filter -H --regex Type:.\*string-byte\* --ge SinceVersion:1.5 *.tsv
# Find all annotations which have the ExData key
grep ^ExData Annot* | tsv-pretty
The term "reduction" is used to describe how predicates and their parameters get recursively processed from left-to-right. At any point a predicate or argument can be indeterminable, such as when a PDF does not have a key, or if the key is the wrong type, etc.
When thinking about predicates, it is important to remember that not all the parameters (arguments) to predicates will
exist - thus only a portion of a predicate statement may be determinable when checking a PDF file. For example, a
predicate of the form fn:Eval(fn:SomeThing(@A, fn:Not(@B==b))
is expecting that both the /A
and /B
keys will
exist in the current object so that their values can be obtained, but this may not be required (and PDFs don't always
follow requirements anyway!).
Note also that if both /A
and /B
are optional and both had a "DefaultValue" in TSV column 9
then this predicate would always be determinable. Further note that if a key is present but null
then it is
the same as not present!
bit-posn
|
|
version
|
|
Do not use additional whitespace!
Single SPACE characters are only required around logical operators (" &&
" and " ||
"), MINUS (" -
", to disambiguate from a negative number) and the " mod
" mathematical operator.
fn:AlwaysUnencrypted() |
|
fn:ArrayLength(key) |
|
fn:ArraySortAscending(key,integer) |
|
fn:BeforeVersion(version) fn:BeforeVersion(version,statement) |
|
fn:BitClear(bit-posn) |
|
fn:BitSet(bit-posn) |
|
fn:BitsClear(low-bit,high-bit) |
|
fn:BitsSet(low-bit,high-bit) |
|
fn:Contains(@key,value) |
|
fn:DefaultValue(condition,value) |
|
fn:Deprecated(version,statement) |
|
fn:Eval(expr) |
|
fn:Extension(name) fn:Extension(name,value) |
|
fn:FileSize() |
|
fn:FontHasLatinChars() |
|
fn:HasProcessColorants(array) |
|
fn:HasSpotColorants(array) |
|
fn:Ignore() fn:Ignore(condition) |
|
fn:ImageIsStructContentItem() |
|
fn:ImplementationDependent() |
|
fn:InKeyMap(key) |
|
fn:InNameTree(key) |
|
fn:IsAssociatedFile() |
|
fn:IsEncryptedWrapper() |
|
fn:IsFieldName(value) |
|
fn:IsHexString() |
|
fn:IsLastInNumberFormatArray(key) |
|
fn:IsMeaningful(condition) |
|
fn:IsPDFTagged() |
|
fn:IsPDFVersion(version)
fn:IsPDFVersion(version,statement)
|
|
fn:IsPresent(key or expr) fn:IsPresent(key,condition)
|
|
fn:IsRequired(condition) |
|
fn:KeyNameIsColorant() |
|
fn:MustBeDirect() fn:MustBeDirect(condition)
|
|
fn:MustBeIndirect() fn:MustBeIndirect(condition)
|
|
fn:NoCycle() |
|
fn:Not(expr) |
|
fn:NotStandard14Font() |
|
fn:NumberOfPages() |
|
fn:PageContainsStructContentItems() |
|
fn:PageProperty(page-ref,key) |
|
fn:RectHeight(key) |
|
fn:RectWidth(key) |
|
fn:RequiredValue(condition,value) |
|
fn:SinceVersion(version) fn:SinceVersion(version,statement) |
|
fn:StreamLength(key) |
|
fn:StringLength(key) |
|
Please review and add any feedback or comments to the appropriate issue!
fn:ValueOnlyWhen(value,condition) |
|
fn:IsArray(key) fn:IsDictionary(key) fn:IsStream(key)
|
|
fn:IsNameTreeValue(tree-reference,key) fn:IsNameTreeIndex(tree-reference,@key) fn:IsNumberTreeValue(tree-reference,key) fn:IsNumberTreeIndex(tree-reference,@key)
|
|
fn:IsInArray(@key,array)
|
|
fn:AllowNull(key)
|
|
The following are predicate grammar validation checks that should FAIL(!!) for each set of Arlington TSV files that represent a PDF version.
Some of these negative test cases can be done on a row-by-row basis, while others refer to a single TSV file, and a few may require checking other TSV files (such as when using path::key
).
Implicit (semantic) knowledge of valid predicates and their arguments is also required:
- TSV with less than 2 rows (i.e. minimum TSV is header row + at least one key)
- missing or incorrect header row
- duplicate key names (or array indices) in same TSV
- wrong number of fields in TSV (it's fixed!)
- incorrect field ordering (it's fixed!)
- if TSV filename contains "Array" or "ColorSpace" then all rows except header must be integers 0-9 or integer 0-9 + ASTERISK
- excess use of SPACES in predicate expressions (e.g. around
!=
or==
) - invalid PDF version (only 1.0, 1.1, ..., 1.7 and 2.0)
- unknown predicates
- number of
;
in Type field does not match number of;
in non-blank DefaultValue, PossibleValues, SpecialCase or Link fields - unmatched/unbalanced
(
/)
or[
/]
and'
- key = ASTERISK is not last row in TSV
- if 2nd row in a TSV is an integer or integer + ASTERISK and key integer of 2nd row is not 0 (i.e. all array indices start at 0)
- reference to a
key
,@key
,path::key
orpath::@key
that is not valid in a PDF version - mixture of keys that are alphanumeric with integers (0-9) or integers (0-9) followed by ASTERISK
- the list of types in a complex Type field are not alphabetically sorted or separated by SEMI-COLON
- Link field has entries for simple types (incl. for complex types)
- Link field has entries for linked types (incl. for complex types) but Link field is empty or just
[]
- Link field not enclosed in
[
/]
- an unscoped key reference (
key
or@key
) does not precisely match any key in current TSV - a scoped key reference to
trailer::
does not match any key in FileTrailer.tsv (all PDF versions) or XRefStream.tsv (for appropriate PDF versions) - a scoped key reference to
trailer::Catalog
does not match any key in Catalog.tsv - type of data in DefaultValue and PossibleValues fields does not match appropriate Type field
- incorrect number of arguments for predicate
- wrong kind of argument for predicate
- for predicates that only work with specific types of PDF objects, the use of key or self-reference that cannot be that type (e.g.
fn:ArrayLength
is not referencing something that can be an array;fn:StringLength
is not referencing something that can be a string;fn:BitSet
,fn:BitClear
, etc. only work withbitmask
) - mathematical operation on non-numeric data or predicate
- logical operation on non-boolean data or predicate
- reference to a key or key value that has a SinceVersion field that is later than the current key SinceVersion and is not protected with a version-based predicate
- a predicate with a condition argument that is always constant
-
Check array length requirements of all
array
search hits - done up to Table 95 -
Check ranges of all
integer
search hits - done up to Table 170. See also: Errata #15 -
Check explicit ranges of all
number
search hits - not started yet -
Check all arrays for handling of
null
elements - not started yet. See PDF 2.0 Errata #157