Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ABNF description of TOML. #236

Merged
merged 23 commits into from
Jan 4, 2017
Merged

Add ABNF description of TOML. #236

merged 23 commits into from
Jan 4, 2017

Conversation

mojombo
Copy link
Member

@mojombo mojombo commented Jul 18, 2014

One of the most important things we'll need for the TOML 1.0 release is a proper grammar. I've worked one up here as a starting point. I chose RFC 4234 ABNF because it feels much more modern and suitable for a Unicode world than ISO EBNF. Also, JSON uses ABNF and so many people should already be familiar with it.

This grammar should be complete. It also includes scientific notation for floats, as I expect those will go in soon.

I'm certainly no ABNF expert, and I'd love any suggestions for making this grammar more readable.

@mojombo mojombo mentioned this pull request Jul 18, 2014
@keleshev
Copy link

Did you look into using PEG notation instead? http://pdos.csail.mit.edu/papers/parsing:popl04.pdf

What are your thoughts on that?

@ChristianSi
Copy link
Contributor

RFC 4234 has been obsoleted by RFC 5234, so the grammar should reference the latter. Differences between the two RFCs seem to be minor and shouldn't otherwise affect the content of the grammar from what I can tell.

@ChristianSi
Copy link
Contributor

One simplification that comes to mind: RFC 4|5234 allows writing printable ASCII character between quotation marks. Somewhat contra-intuitively, this implies case-insensitive matching for letters so case-sensitive rules such as true = %x74.72.75.65 ; true must indeed be written in this form to avoid also matching TRUE, True etc.

For printable ASCII chars that aren't letters this isn't a problem, though, hence the grammar could write " " instead of %x20, "#" instead of %x23, "=" instead of %x3D, "-" / "_" instead of %x2D / %x5F etc. Doing this wherever possible would IMHO significantly improve readability, as I don't find all those %x... combinations helpful.

Case-insensitive matching could also be used for simplification in at least one case:

exp = e [ minus / plus ] 1*DIGIT
e = %x65 / %x45 ; e E

could be replaced with

exp = "E" [ minus / plus ] 1*DIGIT

without changing the meaning.

@keleshev
Copy link

PEG notation would make it much more readable, I think.

Compare, ABNF:

boolean = true / false
true    = %x74.72.75.65     ; true
false   = %x66.61.6c.73.65  ; false
...
datetime = ymd tee hms zee
ymd = 4DIGIT dash 2DIGIT dash 2DIGIT
hms = 2DIGIT colon 2DIGIT colon 2DIGIT
dash = %x2D  ; -
colon = %x3A ; :
tee = %x54   ; T
zee = %x5A   ; Z

PEG:

boolean <- "true" / "false"
...
datetime <- ymd "T" hms "Z"
ymd <- 4DIGIT "-" 2DIGIT "-" 2DIGIT
hms <- 2DIGIT ":" 2DIGIT ":" 2DIGIT

Or even:

datetime <- dd dd "-" dd "-" dd "T" dd ":" dd ":" dd "Z"
dd <- digit digit

UPDATED for better example.

@ChristianSi
Copy link
Contributor

One last point: it seems that the grammar is much more strict that the current textual description of TOML about what is allowed in a key. The README allows practically anything, at long as keys don't start or end with whitespace and don't contain any of .[]#=. The grammar, however, just allows A-Z / a-z / 0-9 / - / _ for unquoted keys and asks for quoted strings in all other cases.

IMHO this would be unfortunate at it would discriminate against non-English languages by restricting keys to basically English words, while non-English words such as "café" or "süß" would no longer be allowed as keys. If further restrictions are necessary and desired (I'm not really convinced), then at least, "ALPHA / DIGIT" should be replaced with "any Unicode letter or digit".

That would be similar to what XML and modern programming languages (Javascript, Python etc.) allow as identifiers.

@mojombo
Copy link
Member Author

mojombo commented Jul 18, 2014

PEG notation would make it much more readable, I think.

@halst I'll give that paper a read. I like the idea of using a grammar specifically suited for recognizing a language (rather than producing a valid instance of a language).

RFC 4234 has been obsoleted by RFC 5234

@ChristianSi I'll update the document to reflect this. I don't see any material changes it made from 4234, so that's all good.

One simplification that comes to mind: RFC 4|5234 allows writing printable ASCII character between quotation marks.

Yeah, this is the thing I like the least about ABNF, in regards to alpha characters. With regards to symbols, I decided to specify everything in hex to eliminate any possible ambiguity from misinterpreting which character might be in question. I actually like this kind of specificity, and don't mind it in combination with the comments that show what the corresponding ASCII character is.

it seems that the grammar is much more strict that the current textual description of TOML about what is allowed in a key.

True, and I intend to change this once we nail down what is allowed in a key. I mostly put it in there as a conservative placeholder for now. It seems you've read the grammar quite carefully; thanks!

@flaviut
Copy link

flaviut commented Jan 4, 2015

Is there a reason that 01 is invalid? It seems like an arbitrary restriction for which I cannot see any justification.

time-second = 2DIGIT ; 00-58, 00-59, 00-60 based on leap second rules
time-secfrac = "." 1*DIGIT
time-numoffset = ( "+" / "-" ) time-hour ":" time-minute
time-offset = "Z" / time-numoffset
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be clarified that Z can be uppercase or lowercase:

NOTE: Per [ABNF] and ISO8601, the "T" and "Z" characters in this
syntax may alternatively be lower case "t" or "z" respectively.

Same with <exp> and <date-time>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@BurntSushi
Copy link
Member

Is there a reason that 01 is invalid? It seems like an arbitrary restriction for which I cannot see any justification.

Is there a reason why it should be allowed? In many places, prefixing a number with a 0 is an instruction that the number should be interpreted as octal. TOML doesn't support octal, so it seems surprising to allow it.

Moreover, if TOML were ever to adopt octal numerals, this restriction would allow that to be a backwards compatible addition.

@dhardy
Copy link

dhardy commented Jan 4, 2015

@BurntSushi out with the old 0755 octal notation. If you need octal, use something less ambiguous like 0c755.

@flaviut
Copy link

flaviut commented Jan 4, 2015

I agree with @dhardy. 0777 is terrible for beginners and confusing for everyone else.

@BurntSushi
Copy link
Member

0777 is the prevailing convention.

@BurntSushi
Copy link
Member

(And that's a good enough reason to bar integrrs from starting with a 0 IMO. And my backwards compatible argument is still viable, because it gives us the flexibility to choose the syntax.)

std-table-close = ws %x5D ; ] Right square bracket
table-key-sep = ws %x2E ws ; . Period

std-table = std-table-open key *( table-key-sep key) std-table-close
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule will match [], which is expressly forbidden in the spec.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using <key> in this way does not follow the spec, which makes no mention of quoted table names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flaviut I don't see how it will match []. The <key> rule mandates that at least 1 character be present. Also, see #283 which clarifies key names to match the rules present here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad then, sorry. It seems like I read over the first <key> without noticing it.

@mojombo
Copy link
Member Author

mojombo commented Jan 16, 2015

A few updates on this:

  1. I've spent some time writing a PEG definition of TOML to see if that would be better to use than ABNF. I like the ideas behind PEGs a lot, but I don't think it's a suitable grammar format. There does not seem to be any official RFC-like standard for PEGs and the Bryan Ford paper is woefully lacking in details like how to specify ranges of Unicode characters, specific numbers of character repetition, etc. I haven't seen a single real-world example of it being used as an official definition, anywhere, ever. So I'm going to stick with ABNF.
  2. I'd like to merge the various parts of the ABNF into the full spec soon, so if anyone has additional changes to propose before that happens, now is the time!

@alexcrichton
Copy link
Contributor

Currently this add syntax for inline tables, so I just wanted to confirm, would you want to add that now or merge that part of the ABNF at a possible future date?

This is also super helpful @mojombo, thanks for doing this!

@mojombo
Copy link
Member Author

mojombo commented Jan 16, 2015

@alexcrichton I added that just to see how complex it would be. I'd merge that into the spec when Inline Tables go in.

array = array-open array-values array-close

array-values = [ val [ array-sep ] [ ( comment newlines) / newlines ] /
val array-sep [ ( comment newlines) / newlines ] array-values ]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This disallows newline before the first element, e.g.

# this is not allowed
key = [
    1]

but allows it after:

# this is ok
key = [1
    ]

Is that intended?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully not. Could this be fixed? (It's surprising behavior)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also disallows newlines between values and commas. Is it ok?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mojombo Ping...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any news on this?

@ghost ghost mentioned this pull request Jun 22, 2015
@avakar avakar mentioned this pull request Jun 24, 2015
andrusha97 added a commit to andrusha97/loltoml that referenced this pull request Oct 17, 2015
RFC 4234 Section 3.5 advises the use of grouping notation rather than "bare" alternation when alternatives consist of multiple rules or literals, e.g. "( int float ) / ( bool char )" instead of "int float / bool char".

I might've gotten the grouping wrong, which just serves to illustrate the importance of using grouping notation.
The README.md has this at the end of the first TOML example:
```toml
hosts = [
  "alpha",
  "omega"
]
```
And later says:
>...arrays also ignore newlines....
But the ABNF doesn't allow for a newline to precede the first element.

I wasn't sure if the grammar should allow more than one newline preceding the first element, but given that an arbitrary number of whitespaces are allowed and the readme says that newlines are ignored in general I figured allowing an arbitrary number of newlines best matches your intent. It also won't break anyone that happens to have multiple leading newlines in their arrays.
The array-values production already allows for an arbitrary amount of newlines before the closing brace. It doesn't, however, allow any whitespace after the newlines that don't come just before the closing brace so:
```toml
[5, 6
\t\t]
```
is valid, but
```toml
[5, 6
\t\t
]
```
is invalid. Probably not a big deal.
Back to the README example:
```toml
hosts = [
  "alpha",
  "omega"
]
```
The ABNF allows spaces after an element, but before any newlines, the examples in the README have spaces before elements after a newline which isn't allowed in the grammar. I've added a new production "ws-newlines" which is just like "ws-newline" except it requires at least one newline like the production it replaces("newlines"). With "ws-newlines" it makes it legal to to put whitespace before an element after a newline.
Going over your hard example I found I couldn't parse this:
```toml
        multi_line_array = [
            "]",
            # ] Oh yes I did
            ]
```
It turns out that the grammar doesn't allow newlines and whitespaces after the array-sep. I've modified the production again to accommodate newlines and whitespace after the array separator.

;; Built-in ABNF terms, reproduced here for clarity

; ALPHA = %x41-5A / %x61-7A ; A-Z / a-z
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use unicode ALPHA? Allow characters like 'Æ', 'ø', 'å', 'ß' to be in keywords. Allow f.ex. Arabic or Hebrew to be used in key words.

after the last element without a leading newline.
@dead-claudia
Copy link

@mojombo Ping?

@pradyunsg
Copy link
Member

@mojombo Ping 2.0?

@mojombo
Copy link
Member Author

mojombo commented Jan 4, 2017

I'm pretty happy with the ABNF now, and I think it'll be super helpful for implementers. Time to merge! From now on, we should update the ABNF any time the TOML spec changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.