-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ABNF description of TOML. #236
Conversation
Did you look into using PEG notation instead? http://pdos.csail.mit.edu/papers/parsing:popl04.pdf What are your thoughts on that? |
RFC 4234 has been obsoleted by RFC 5234, so the grammar should reference the latter. Differences between the two RFCs seem to be minor and shouldn't otherwise affect the content of the grammar from what I can tell. |
One simplification that comes to mind: RFC 4|5234 allows writing printable ASCII character between quotation marks. Somewhat contra-intuitively, this implies case-insensitive matching for letters so case-sensitive rules such as For printable ASCII chars that aren't letters this isn't a problem, though, hence the grammar could write Case-insensitive matching could also be used for simplification in at least one case:
could be replaced with
without changing the meaning. |
PEG notation would make it much more readable, I think. Compare, ABNF:
PEG:
Or even:
UPDATED for better example. |
One last point: it seems that the grammar is much more strict that the current textual description of TOML about what is allowed in a key. The README allows practically anything, at long as keys don't start or end with whitespace and don't contain any of IMHO this would be unfortunate at it would discriminate against non-English languages by restricting keys to basically English words, while non-English words such as "café" or "süß" would no longer be allowed as keys. If further restrictions are necessary and desired (I'm not really convinced), then at least, "ALPHA / DIGIT" should be replaced with "any Unicode letter or digit". That would be similar to what XML and modern programming languages (Javascript, Python etc.) allow as identifiers. |
@halst I'll give that paper a read. I like the idea of using a grammar specifically suited for recognizing a language (rather than producing a valid instance of a language).
@ChristianSi I'll update the document to reflect this. I don't see any material changes it made from 4234, so that's all good.
Yeah, this is the thing I like the least about ABNF, in regards to alpha characters. With regards to symbols, I decided to specify everything in hex to eliminate any possible ambiguity from misinterpreting which character might be in question. I actually like this kind of specificity, and don't mind it in combination with the comments that show what the corresponding ASCII character is.
True, and I intend to change this once we nail down what is allowed in a key. I mostly put it in there as a conservative placeholder for now. It seems you've read the grammar quite carefully; thanks! |
Is there a reason that |
time-second = 2DIGIT ; 00-58, 00-59, 00-60 based on leap second rules | ||
time-secfrac = "." 1*DIGIT | ||
time-numoffset = ( "+" / "-" ) time-hour ":" time-minute | ||
time-offset = "Z" / time-numoffset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be clarified that Z
can be uppercase or lowercase:
NOTE: Per [ABNF] and ISO8601, the "T" and "Z" characters in this
syntax may alternatively be lower case "t" or "z" respectively.
Same with <exp>
and <date-time>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
Is there a reason why it should be allowed? In many places, prefixing a number with a Moreover, if TOML were ever to adopt octal numerals, this restriction would allow that to be a backwards compatible addition. |
@BurntSushi out with the old |
I agree with @dhardy. |
|
(And that's a good enough reason to bar integrrs from starting with a |
std-table-close = ws %x5D ; ] Right square bracket | ||
table-key-sep = ws %x2E ws ; . Period | ||
|
||
std-table = std-table-open key *( table-key-sep key) std-table-close |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule will match []
, which is expressly forbidden in the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using <key>
in this way does not follow the spec, which makes no mention of quoted table names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad then, sorry. It seems like I read over the first <key>
without noticing it.
A few updates on this:
|
Currently this add syntax for inline tables, so I just wanted to confirm, would you want to add that now or merge that part of the ABNF at a possible future date? This is also super helpful @mojombo, thanks for doing this! |
@alexcrichton I added that just to see how complex it would be. I'd merge that into the spec when Inline Tables go in. |
array = array-open array-values array-close | ||
|
||
array-values = [ val [ array-sep ] [ ( comment newlines) / newlines ] / | ||
val array-sep [ ( comment newlines) / newlines ] array-values ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This disallows newline before the first element, e.g.
# this is not allowed
key = [
1]
but allows it after:
# this is ok
key = [1
]
Is that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully not. Could this be fixed? (It's surprising behavior)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also disallows newlines between values and commas. Is it ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mojombo Ping...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any news on this?
RFC 4234 Section 3.5 advises the use of grouping notation rather than "bare" alternation when alternatives consist of multiple rules or literals, e.g. "( int float ) / ( bool char )" instead of "int float / bool char". I might've gotten the grouping wrong, which just serves to illustrate the importance of using grouping notation.
The README.md has this at the end of the first TOML example: ```toml hosts = [ "alpha", "omega" ] ``` And later says: >...arrays also ignore newlines.... But the ABNF doesn't allow for a newline to precede the first element. I wasn't sure if the grammar should allow more than one newline preceding the first element, but given that an arbitrary number of whitespaces are allowed and the readme says that newlines are ignored in general I figured allowing an arbitrary number of newlines best matches your intent. It also won't break anyone that happens to have multiple leading newlines in their arrays.
The array-values production already allows for an arbitrary amount of newlines before the closing brace. It doesn't, however, allow any whitespace after the newlines that don't come just before the closing brace so: ```toml [5, 6 \t\t] ``` is valid, but ```toml [5, 6 \t\t ] ``` is invalid. Probably not a big deal.
Back to the README example: ```toml hosts = [ "alpha", "omega" ] ``` The ABNF allows spaces after an element, but before any newlines, the examples in the README have spaces before elements after a newline which isn't allowed in the grammar. I've added a new production "ws-newlines" which is just like "ws-newline" except it requires at least one newline like the production it replaces("newlines"). With "ws-newlines" it makes it legal to to put whitespace before an element after a newline.
Going over your hard example I found I couldn't parse this: ```toml multi_line_array = [ "]", # ] Oh yes I did ] ``` It turns out that the grammar doesn't allow newlines and whitespaces after the array-sep. I've modified the production again to accommodate newlines and whitespace after the array separator.
|
||
;; Built-in ABNF terms, reproduced here for clarity | ||
|
||
; ALPHA = %x41-5A / %x61-7A ; A-Z / a-z |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use unicode ALPHA
? Allow characters like 'Æ', 'ø', 'å', 'ß' to be in keywords. Allow f.ex. Arabic or Hebrew to be used in key words.
after the last element without a leading newline.
@mojombo Ping? |
@mojombo Ping 2.0? |
Added grouping to ambiguous alternatives, allow leading newlines inside arrays and spaces after newlines before array elements and newlines and spaces before comments in arrays
I'm pretty happy with the ABNF now, and I think it'll be super helpful for implementers. Time to merge! From now on, we should update the ABNF any time the TOML spec changes. |
One of the most important things we'll need for the TOML 1.0 release is a proper grammar. I've worked one up here as a starting point. I chose RFC 4234 ABNF because it feels much more modern and suitable for a Unicode world than ISO EBNF. Also, JSON uses ABNF and so many people should already be familiar with it.
This grammar should be complete. It also includes scientific notation for floats, as I expect those will go in soon.
I'm certainly no ABNF expert, and I'd love any suggestions for making this grammar more readable.