Add ABNF description of TOML. #236

mojombo · 2014-07-18T00:14:42Z

One of the most important things we'll need for the TOML 1.0 release is a proper grammar. I've worked one up here as a starting point. I chose RFC 4234 ABNF because it feels much more modern and suitable for a Unicode world than ISO EBNF. Also, JSON uses ABNF and so many people should already be familiar with it.

This grammar should be complete. It also includes scientific notation for floats, as I expect those will go in soon.

I'm certainly no ABNF expert, and I'd love any suggestions for making this grammar more readable.

keleshev · 2014-07-18T07:58:37Z

Did you look into using PEG notation instead? http://pdos.csail.mit.edu/papers/parsing:popl04.pdf

What are your thoughts on that?

ChristianSi · 2014-07-18T12:27:27Z

RFC 4234 has been obsoleted by RFC 5234, so the grammar should reference the latter. Differences between the two RFCs seem to be minor and shouldn't otherwise affect the content of the grammar from what I can tell.

ChristianSi · 2014-07-18T12:42:05Z

One simplification that comes to mind: RFC 4|5234 allows writing printable ASCII character between quotation marks. Somewhat contra-intuitively, this implies case-insensitive matching for letters so case-sensitive rules such as true = %x74.72.75.65 ; true must indeed be written in this form to avoid also matching TRUE, True etc.

For printable ASCII chars that aren't letters this isn't a problem, though, hence the grammar could write " " instead of %x20, "#" instead of %x23, "=" instead of %x3D, "-" / "_" instead of %x2D / %x5F etc. Doing this wherever possible would IMHO significantly improve readability, as I don't find all those %x... combinations helpful.

Case-insensitive matching could also be used for simplification in at least one case:

exp = e [ minus / plus ] 1*DIGIT
e = %x65 / %x45 ; e E

could be replaced with

exp = "E" [ minus / plus ] 1*DIGIT

without changing the meaning.

keleshev · 2014-07-18T12:48:41Z

PEG notation would make it much more readable, I think.

Compare, ABNF:

boolean = true / false
true    = %x74.72.75.65     ; true
false   = %x66.61.6c.73.65  ; false
...
datetime = ymd tee hms zee
ymd = 4DIGIT dash 2DIGIT dash 2DIGIT
hms = 2DIGIT colon 2DIGIT colon 2DIGIT
dash = %x2D  ; -
colon = %x3A ; :
tee = %x54   ; T
zee = %x5A   ; Z

PEG:

boolean <- "true" / "false"
...
datetime <- ymd "T" hms "Z"
ymd <- 4DIGIT "-" 2DIGIT "-" 2DIGIT
hms <- 2DIGIT ":" 2DIGIT ":" 2DIGIT

Or even:

datetime <- dd dd "-" dd "-" dd "T" dd ":" dd ":" dd "Z"
dd <- digit digit

UPDATED for better example.

ChristianSi · 2014-07-18T12:56:47Z

One last point: it seems that the grammar is much more strict that the current textual description of TOML about what is allowed in a key. The README allows practically anything, at long as keys don't start or end with whitespace and don't contain any of .[]#=. The grammar, however, just allows A-Z / a-z / 0-9 / - / _ for unquoted keys and asks for quoted strings in all other cases.

IMHO this would be unfortunate at it would discriminate against non-English languages by restricting keys to basically English words, while non-English words such as "café" or "süß" would no longer be allowed as keys. If further restrictions are necessary and desired (I'm not really convinced), then at least, "ALPHA / DIGIT" should be replaced with "any Unicode letter or digit".

That would be similar to what XML and modern programming languages (Javascript, Python etc.) allow as identifiers.

mojombo · 2014-07-18T20:59:43Z

PEG notation would make it much more readable, I think.

@halst I'll give that paper a read. I like the idea of using a grammar specifically suited for recognizing a language (rather than producing a valid instance of a language).

RFC 4234 has been obsoleted by RFC 5234

@ChristianSi I'll update the document to reflect this. I don't see any material changes it made from 4234, so that's all good.

One simplification that comes to mind: RFC 4|5234 allows writing printable ASCII character between quotation marks.

Yeah, this is the thing I like the least about ABNF, in regards to alpha characters. With regards to symbols, I decided to specify everything in hex to eliminate any possible ambiguity from misinterpreting which character might be in question. I actually like this kind of specificity, and don't mind it in combination with the comments that show what the corresponding ASCII character is.

it seems that the grammar is much more strict that the current textual description of TOML about what is allowed in a key.

True, and I intend to change this once we nail down what is allowed in a key. I mostly put it in there as a conservative placeholder for now. It seems you've read the grammar quite carefully; thanks!

flaviut · 2015-01-04T01:58:21Z

Is there a reason that 01 is invalid? It seems like an arbitrary restriction for which I cannot see any justification.

flaviut · 2015-01-04T02:58:14Z

toml.abnf

+time-second    = 2DIGIT  ; 00-58, 00-59, 00-60 based on leap second rules
+time-secfrac   = "." 1*DIGIT
+time-numoffset = ( "+" / "-" ) time-hour ":" time-minute
+time-offset    = "Z" / time-numoffset


It should be clarified that Z can be uppercase or lowercase:

NOTE: Per [ABNF] and ISO8601, the "T" and "Z" characters in this
syntax may alternatively be lower case "t" or "z" respectively.

Same with <exp> and <date-time>.

BurntSushi · 2015-01-04T03:09:46Z

Is there a reason that 01 is invalid? It seems like an arbitrary restriction for which I cannot see any justification.

Is there a reason why it should be allowed? In many places, prefixing a number with a 0 is an instruction that the number should be interpreted as octal. TOML doesn't support octal, so it seems surprising to allow it.

Moreover, if TOML were ever to adopt octal numerals, this restriction would allow that to be a backwards compatible addition.

dhardy · 2015-01-04T11:57:45Z

@BurntSushi out with the old 0755 octal notation. If you need octal, use something less ambiguous like 0c755.

flaviut · 2015-01-04T12:51:13Z

I agree with @dhardy. 0777 is terrible for beginners and confusing for everyone else.

BurntSushi · 2015-01-04T14:10:43Z

0777 is the prevailing convention.

BurntSushi · 2015-01-04T14:12:18Z

(And that's a good enough reason to bar integrrs from starting with a 0 IMO. And my backwards compatible argument is still viable, because it gives us the flexibility to choose the syntax.)

flaviut · 2015-01-04T17:57:57Z

toml.abnf

+std-table-close = ws %x5D     ; ] Right square bracket
+table-key-sep   = ws %x2E ws  ; . Period
+
+std-table = std-table-open key *( table-key-sep key) std-table-close


This rule will match [], which is expressly forbidden in the spec.

Using <key> in this way does not follow the spec, which makes no mention of quoted table names.

@flaviut I don't see how it will match []. The <key> rule mandates that at least 1 character be present. Also, see #283 which clarifies key names to match the rules present here.

My bad then, sorry. It seems like I read over the first <key> without noticing it.

mojombo · 2015-01-16T17:57:39Z

A few updates on this:

I've spent some time writing a PEG definition of TOML to see if that would be better to use than ABNF. I like the ideas behind PEGs a lot, but I don't think it's a suitable grammar format. There does not seem to be any official RFC-like standard for PEGs and the Bryan Ford paper is woefully lacking in details like how to specify ranges of Unicode characters, specific numbers of character repetition, etc. I haven't seen a single real-world example of it being used as an official definition, anywhere, ever. So I'm going to stick with ABNF.
I'd like to merge the various parts of the ABNF into the full spec soon, so if anyone has additional changes to propose before that happens, now is the time!

alexcrichton · 2015-01-16T18:13:59Z

Currently this add syntax for inline tables, so I just wanted to confirm, would you want to add that now or merge that part of the ABNF at a possible future date?

This is also super helpful @mojombo, thanks for doing this!

mojombo · 2015-01-16T18:27:04Z

@alexcrichton I added that just to see how complex it would be. I'd merge that into the spec when Inline Tables go in.

avakar · 2015-06-21T15:57:37Z

toml.abnf

+array = array-open array-values array-close
+
+array-values = [ val [ array-sep ] [ ( comment newlines) / newlines ] /
+                 val array-sep [ ( comment newlines) / newlines ] array-values ]


This disallows newline before the first element, e.g.

# this is not allowed key = [ 1]

but allows it after:

# this is ok key = [1 ]

Is that intended?

Hopefully not. Could this be fixed? (It's surprising behavior)

It also disallows newlines between values and commas. Is it ok?

@mojombo Ping...

Any news on this?

RFC 4234 Section 3.5 advises the use of grouping notation rather than "bare" alternation when alternatives consist of multiple rules or literals, e.g. "( int float ) / ( bool char )" instead of "int float / bool char". I might've gotten the grouping wrong, which just serves to illustrate the importance of using grouping notation.

The README.md has this at the end of the first TOML example: ```toml hosts = [ "alpha", "omega" ] ``` And later says: >...arrays also ignore newlines.... But the ABNF doesn't allow for a newline to precede the first element. I wasn't sure if the grammar should allow more than one newline preceding the first element, but given that an arbitrary number of whitespaces are allowed and the readme says that newlines are ignored in general I figured allowing an arbitrary number of newlines best matches your intent. It also won't break anyone that happens to have multiple leading newlines in their arrays.

The array-values production already allows for an arbitrary amount of newlines before the closing brace. It doesn't, however, allow any whitespace after the newlines that don't come just before the closing brace so: ```toml [5, 6 \t\t] ``` is valid, but ```toml [5, 6 \t\t ] ``` is invalid. Probably not a big deal.

Back to the README example: ```toml hosts = [ "alpha", "omega" ] ``` The ABNF allows spaces after an element, but before any newlines, the examples in the README have spaces before elements after a newline which isn't allowed in the grammar. I've added a new production "ws-newlines" which is just like "ws-newline" except it requires at least one newline like the production it replaces("newlines"). With "ws-newlines" it makes it legal to to put whitespace before an element after a newline.

Going over your hard example I found I couldn't parse this: ```toml multi_line_array = [ "]", # ] Oh yes I did ] ``` It turns out that the grammar doesn't allow newlines and whitespaces after the array-sep. I've modified the production again to accommodate newlines and whitespace after the array separator.

TerjeBr · 2016-01-08T13:41:00Z

toml.abnf

+
+;; Built-in ABNF terms, reproduced here for clarity
+
+; ALPHA = %x41-5A / %x61-7A ; A-Z / a-z


Why not use unicode ALPHA? Allow characters like 'Æ', 'ø', 'å', 'ß' to be in keywords. Allow f.ex. Arabic or Hebrew to be used in key words.

after the last element without a leading newline.

dead-claudia · 2016-05-11T10:50:06Z

@mojombo Ping?

pradyunsg · 2016-12-23T08:31:18Z

@mojombo Ping 2.0?

Added grouping to ambiguous alternatives, allow leading newlines inside arrays and spaces after newlines before array elements and newlines and spaces before comments in arrays

mojombo · 2017-01-04T23:02:33Z

I'm pretty happy with the ABNF now, and I think it'll be super helpful for implementers. Time to merge! From now on, we should update the ABNF any time the TOML spec changes.

Add ABNF description of TOML.

fcd0229

mojombo mentioned this pull request Jul 18, 2014

Add EBNF #199

Closed

mojombo added 2 commits July 17, 2014 19:49

Properly specify unquoted-key.

cb50c72

Fix whitespace in array-table-close.

7b439f6

mojombo added 2 commits November 10, 2014 16:59

Update ABNF with v0.3.0 compliant integer/float rules.

bd4fe79

Update ABNF with RFC 3339 datetime spec to be v0.3.0 compliant.

4d15ea5

mojombo mentioned this pull request Nov 11, 2014

Grammar is ambiguous or confusing. #262

Closed

Update for newline clarifications.

52dc64a

flaviut reviewed Jan 4, 2015
View reviewed changes

Add \UXXXXXXXX escape sequence.

3808902

mojombo mentioned this pull request Jan 16, 2015

Clarification about newlines in multi-line strings and empty keys in quotes #286

Closed

Allow underscores in int/float ABNF.

0cee090

avakar reviewed Jun 21, 2015
View reviewed changes

avakar mentioned this pull request Jun 21, 2015

Subsume comments in newline #333

Closed

ghost mentioned this pull request Jun 22, 2015

[RFC] TOML v0.4.x grammar in ABNF #334

Closed

avakar mentioned this pull request Jun 24, 2015

EBNF Specification #336

Closed

andrusha97 added a commit to andrusha97/loltoml that referenced this pull request Oct 17, 2015

Make the parser compliant with formal grammar from toml-lang/toml#236

8a3847b

joelself added 5 commits December 31, 2015 13:04

TerjeBr reviewed Jan 8, 2016
View reviewed changes

alexcrichton mentioned this pull request Jan 19, 2016

RFC: Improve Cargo target-specific dependencies rust-lang/rfcs#1361

Merged

Adjusted the abnf to allow for whitespaces

75f6ba3

after the last element without a leading newline.

benley mentioned this pull request Mar 31, 2016

add manifestToml(v) function google/jsonnet#149

Closed

alexcrichton mentioned this pull request Apr 18, 2016

Require newline after table toml-rs/toml-rs#94

Merged

alexcrichton mentioned this pull request May 22, 2016

Accepts invalid datetimes toml-rs/toml-rs#99

Closed

mojombo added 8 commits January 3, 2017 16:12

Merge pull request #378 from joelself/abnf

9882e88

Added grouping to ambiguous alternatives, allow leading newlines inside arrays and spaces after newlines before array elements and newlines and spaces before comments in arrays

Add ABNF work-in-progress warning.

92f20e5

Fix array grouping.

eb703d2

Add ABNF for Date and Time.

a2c74ce

RFC 5234 is latest for ABNF.

7c0db2c

Make ABNF compatible with a real ABNF parser.

f9d4429

Reorder ABNF rules for better clarity.

514037d

Delete unused rule.

ecb8274

mojombo merged commit 0fbaefd into master Jan 4, 2017

mojombo deleted the abnf branch February 5, 2018 00:33

workingjubilee mentioned this pull request Sep 3, 2019

Change ABNF array definition to permit single value types only #663

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ABNF description of TOML. #236

Add ABNF description of TOML. #236

mojombo commented Jul 18, 2014

keleshev commented Jul 18, 2014

ChristianSi commented Jul 18, 2014

ChristianSi commented Jul 18, 2014

keleshev commented Jul 18, 2014

ChristianSi commented Jul 18, 2014

mojombo commented Jul 18, 2014

flaviut commented Jan 4, 2015

flaviut Jan 4, 2015

BurntSushi Jan 4, 2015

BurntSushi commented Jan 4, 2015

dhardy commented Jan 4, 2015

flaviut commented Jan 4, 2015

BurntSushi commented Jan 4, 2015

BurntSushi commented Jan 4, 2015

flaviut Jan 4, 2015

flaviut Jan 4, 2015

mojombo Jan 7, 2015

flaviut Jan 8, 2015

mojombo commented Jan 16, 2015

alexcrichton commented Jan 16, 2015

mojombo commented Jan 16, 2015

avakar Jun 21, 2015

dead-claudia Oct 16, 2015

andrusha97 Oct 18, 2015

dead-claudia Oct 19, 2015

0xmichalis Sep 12, 2016

TerjeBr Jan 8, 2016

dead-claudia commented May 11, 2016

pradyunsg commented Dec 23, 2016

mojombo commented Jan 4, 2017


		;; Built-in ABNF terms, reproduced here for clarity

		; ALPHA = %x41-5A / %x61-7A ; A-Z / a-z

Add ABNF description of TOML. #236

Add ABNF description of TOML. #236

Conversation

mojombo commented Jul 18, 2014

keleshev commented Jul 18, 2014

ChristianSi commented Jul 18, 2014

ChristianSi commented Jul 18, 2014

keleshev commented Jul 18, 2014

ChristianSi commented Jul 18, 2014

mojombo commented Jul 18, 2014

flaviut commented Jan 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BurntSushi commented Jan 4, 2015

dhardy commented Jan 4, 2015

flaviut commented Jan 4, 2015

BurntSushi commented Jan 4, 2015

BurntSushi commented Jan 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mojombo commented Jan 16, 2015

alexcrichton commented Jan 16, 2015

mojombo commented Jan 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dead-claudia commented May 11, 2016

pradyunsg commented Dec 23, 2016

mojombo commented Jan 4, 2017