Disallow integers in keys #592

xoofx · 2019-02-02T07:34:39Z

Hey,

While implementing the lexer for TOML, I have found that allowing integers in a key makes the lexer parser state dependent. It is pretty uncommon to have this in simple languages as it happens usually in languages that have complicated context dependent parsing (e.g preprocessor macros in a C like language). Imo, it's an unfortunate choice for a simple language like TOML to incur contextual lexers.

Another weird side effect of this is that you can have in a key something that looks like a float, or an integer hexa

[table]
1.0 = 2.0
0x155 = 0x155

I don't know if this is something that could be revisited before 1.0 and change a key to [a-zA-Z\-_][a-zA-Z\-_0-9]*?

The text was updated successfully, but these errors were encountered:

pradyunsg · 2019-02-02T09:51:32Z

ASCII letters, ASCII digits, underscores, and dashes (A-Za-z0-9_-)

Base keys can not contain .. The first key is thus not valid TOML.

As for the alternative notations, I don't think it's really that big an issue.

xoofx · 2019-02-02T10:00:55Z

Base keys can not contain .. The first key is thus not valid TOML.

Sorry, I used the term key as a bag for base key and dot key. So 1.0 is a valid dot key right?

As for the alternative notations, I don't think it's really that big an issue.

hm, I don't understand which alternatives you are referring to? 😅

I'm just pointing a double issue:

the base key/dot key allow to parse things that look like integers, floating points while of course the specs says that it would be converted to a string, but imo this is confusing
it makes the lexer state dependent, which is not a great choice for a language that is supposed to be simple

bd82 · 2019-02-02T11:02:05Z

Interesting.

I have recently implemented a Toml lexer & parser but did not notice this edge case.

Perhaps it can be resolved by lexer states that are independent of the parser state?
Example:

Start in KEY mode (no floating point allowed)
Once an "=" token is encountered we are in VALUE mode and floating points are allowed.
Finding a newline in VALUE mode would move us back to KEY mode.

This obviously is naive and simplified (no handling of inline-tables) but it is only meant to demonstrate the concept.

btw if we can differentiate between TOP Level context (key) and value mode
that will also resolve the "[[" vs "[" ambiguity (table vs table of tables vs array of arrays)

xoofx · 2019-02-02T11:58:12Z

Perhaps it can be resolved by lexer states that are independent of the parser state?

That's a bit playing with the concepts! 😅 As it's basically bringing parser knowledge into the lexer... and also as you said, inline tables are going to require more knowledge to correctly do it.

In my case, my lexer has something like this:

if (State == LexerSate.Key)
{
    NextTokenForKey();
}
else
{
    NextTokenForValue();
}

and the parser is feeding the lexer with the correct state, and I'm "fine" with this solution. But this is annoying, because typically, I cannot test my lexer separately on a full TOML document and it is even barely unit testable on slices, as it would require my unit tests to have this parser knowledge...

But I'm worried that my original issue dragged the discussion to some "implementation details" while I should have put an emphasis on the user experience ambiguity first (e.g 1.0 = 2.0), implying the side effect of an unfortunate parser/lexer state coupling

bd82 · 2019-02-02T12:14:23Z

Back to the high level concepts.

I agree that from a simplicity perspective that it would be best to decouple the lexing and parsing state completely.

If only due to the fact some language tools have completely separate lexing and parsing
phases so more tools could be used to build (correct) parsers for Toml without "advanced mitigation logic"

Some References to a similar (but more complex) scenario:

ECMAScript is difficult to lex without parser knowledge.

There is another concern that when creating a fault tolerant parser,
for example to be used in an IDE.
It is much easier to naively recover from lexing errors by simply ignoring unidentified
character sequences when the lexer has only a single mode.

xoofx · 2019-02-02T12:16:38Z

It is much easier to naively recover from lexing errors by simply ignoring unidentified
character sequences when the lexer has only a single mode.

Yes, and specially, in the case of TOML, we are lucky that the grammar doesn't require token previews

Jezza · 2019-02-02T23:10:07Z

So, I just thought I'd add my two cents, having already taken a stab at implementing a toml 0.5 parser.

Parser Code

Basically, if it's lexed as number, it needs to be a base 10 integer, otherwise it'll reject it.

I've actually just taken a glance at the spec, and I realise I've actually done some things incorrectly.
First of all, I should definitely accept 0x2 as a valid key.
It's not a number, but still a valid key.

However, floats an issue does come up.
JFlex does nom as much input as it can, and so it'll take 1.0 as a float, over 1 . 0.
I can perform a work around in code, and honestly, it's quite easy, but yeah, just thought
I'd chime in and say maybe the spec should inform us a bit about the precedence.

Reading the actual spec, I would 100% convert 1.0 into a dotted key with 1 and 0.
I think I'll go change my parser to do that.

Jezza · 2019-02-02T23:36:19Z

Ok, just went and changed it:

10 = "asd"
1.0 = "asd"
1.00.0 = "asd"
0x10 = "asd"
0o10 = "asd"
0b10 = "asd"

[1.00.00]
value = "asd"

is now parsed as:

{
    10 = "asd",
    1 = {
        0 = "asd",
        00 = {
            0 = "asd",
            00 = {
                value = "asd",
            },
        },
    },
    0x10 = "asd",
    0o10 = "asd",
    0b10 = "asd",
}

xoofx · 2019-02-03T09:52:26Z

JFlex does nom as much input as it can, and so it'll take 1.0 as a float, over 1 . 0.
I can perform a work around in code, and honestly, it's quite easy, but yeah, just thought
I'd chime in and say maybe the spec should inform us a bit about the precedence.

And that's another case to remove these integers in keys in TOML, as I'm pretty sure that many implementations got actually that wrong.

That's why you can't use a bare lexer, you need to maintain a state, otherwise going through "oh I just parsed a float but it is actually not " is pretty fragile/clunky to recover from (specially with the dot, as the parser is usually in a very different state when it handles a dotted key)

pradyunsg · 2019-02-03T12:20:30Z

So 1.0 is a valid dot key right?

Yes.

This is not going to change in TOML 1.0. We could revisit this post TOML 1.0, if someone's motivated enough to bring this up then. :)

~~Closing since there's nothing actionable here.~~

Jezza · 2019-02-03T14:10:07Z

Personally, I think that integer keys are a bit weird anyway, but I might just not have the use-case for it.

The best action might be to clarify that 1.0 = should be parsed as a dotted key 1 0.

It's implicit in the spec, but as this issue is showing, a couple of people already missed it, and it's not even version 1.0.

pradyunsg · 2019-02-03T14:12:57Z

The best action might be to clarify that 1.0 = should be parsed as a dotted key 1 0.

Happy to accept a PR for this. :)

xoofx · 2019-02-03T14:16:25Z

This is not going to change in TOML 1.0. We could revisit this post TOML 1.0, if someone's motivated enough to bring this up then. :)

Is this because you don't want to break compatibility (while I thought that 0.x could allow this) or you think that the cost of the ambiguity is worth the price that allowing integers in keys is bringing? I could run some stats on all TOML documents on github to find out how many are using integers in their keys if that's something that could balance a decision....
Otherwise, I don't think TOML should revisit later, it is either today or never.

It's implicit in the spec, but as this issue is showing, a couple of people already missed it, and it's not even version 1.0.

Yes, and I'm going to send a PR to https://github.com/BurntSushi/toml-test to add these edge cases... pretty sure that it is going to break a few libraries that were assuming full 0.5 compatibility.

pradyunsg · 2019-02-03T14:17:58Z

Is this because you don't want to break compatibility (while I thought that 0.x could allow this)

This -- the README has a note about 0.5.0 being "extremely stable".

xoofx · 2019-02-03T14:22:25Z

This -- the README has a note about 0.5.0 being "extremely stable".

You are going finicky. 😛 My turn: you have omitted to mention the following phrase:

The goal is for version 1.0.0 to be backwards compatible (as much as humanly possible) with version 0.5.0

So the README doesn't say that it should stay stable at all cost 😉

But fair enough, I will find the time to send a PR to make the specs more explicit.

pradyunsg · 2019-02-03T14:30:16Z

As of version 0.5.0, TOML should be considered extremely stable. The goal is for version 1.0.0 to be backwards compatible (as much as humanly possible) with version 0.5.0. All implementations are strongly encouraged to become 0.5.0 compatible so that the transition to 1.0.0 will be simple when that happens.

@xoofx I'm not sure what you mean by finicky here. I assume you meant it in good faith. :)

xoofx · 2019-02-03T14:39:29Z

@xoofx I'm not sure what you mean by finicky here. I assume you meant it in good faith. :)

I was just teasing a bit, as I found the responses a bit terse regarding the problem I'm bringing here 😊

philippgl · 2019-03-07T19:44:40Z

Reading the spec, if would think, that integers (!= string with digits) are not allowed as keys and

[table]
1.0 = 2.0
0x155 = 0x155

is parsed as

[table]
"1"."0" = 2.0
"0x155" = 0x155

and

1.0 = 2.0
"1"."0" = "whatever"

is forbidden due to the duplicate key. (github output looks strange)

xoofx · 2019-03-08T09:08:13Z

Reading the spec, if would think, that integers (!= string with digits) are not allowed as keys and
... is forbidden due to the duplicate key. (github output looks strange)

Indeed. The point of the discussion here was just to highlight that "integers like" keys can be actually confusing and it makes the lexer dependent on the state of the parser, which is quite annoying.

LongTengDao · 2019-03-09T23:20:32Z

(github output looks strange)

maybe github only support toml earlier version which not support dotted keys...

pradyunsg · 2019-05-06T10:21:02Z

(slight OT) @xoofx I usually try not to be terse or come across as dismissive but apologies if it came across as such. Sarcasm isn't something that gets communicated well unless folks share the same context which I didn't in this case.

The ship has sailed for this. For now, it's better to not break compatibility with TOML 0.5 -> 1.0.

In terms of actionable items here, I am splitting this ticket into 2.

Add a small note to the dotted keys section that 1.0 = true gets parsed into {"1": {"0": true"}}.
Have a post-1.0 discussion for actually disallowing this in the next major version bump.

sevmeyer · 2019-05-07T18:35:58Z

Please excuse my ignorance, but doesn't this also affect keywords? Even with the proposed rule [a-zA-Z\-_][a-zA-Z\-_0-9]*, bare keys remain context-sensitive:

true = false
nan  = inf
-42  = +42

I don't think there is a simple way to avoid this ambiguity.

eksortso · 2019-05-07T23:27:36Z

Devil's advocate says: It's not hard to tell what's on the left side of the equals sign and what's on the right side. Yes you can confuse yourself by naming your keys the same as Boolean values. But there's no good reason to do that.

ChristianSi · 2019-05-08T11:54:43Z

Regarding the post-1.0 discussion, I'd argue that integer-like bare keys should not be forbidden since they might actually be useful in some cases. Maybe someone has a file birthdays.toml:

[December]
3 = "myself"
9 = ["Kirk Douglas", "John Malkovich"]
24 = "Jesus"

That's just an example, but use cases like this should not be ruled out merely to make the lives of parser writers easier. Nor do I see a good reason to force people to use quotation marks in such cases.

pradyunsg · 2019-05-08T14:25:36Z

I was open to ideas/reasons to change the behavior. @ChristianSi's comment from the future (see end of post) has convinced me that it's not worth changing this behavior. There are valid use cases for this and we explicitly call this behavior allowed in the current specification.

We won't be disallowing integers in keys in TOML. A clarification will be added to explain the dotted keys edge case (this is being tracked #616).

Thanks for filing this issue @xoofx and to everyone who's contributed to this discussion! :)

Hi @ChristianSi from the future! :P

ChristianSi · 2019-05-08T14:58:10Z

Well, post-1.0 is still a long way of, so the future seemed the right place to make that comment 😁

Also, this is the first time someone has reacted to a comment of mine before I even made it. Cheers, @pradyunsg !

pradyunsg added the post-1.0 label May 6, 2019

pradyunsg mentioned this issue May 6, 2019

Add an example about dotted numbers (3.14) as a dotted key #616

Closed

pradyunsg closed this as completed May 8, 2019

eksortso mentioned this issue Jan 4, 2020

Dotted syntax for arrays #693

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disallow integers in keys #592

Disallow integers in keys #592

xoofx commented Feb 2, 2019 •

edited

Loading

pradyunsg commented Feb 2, 2019

xoofx commented Feb 2, 2019

bd82 commented Feb 2, 2019 •

edited

Loading

xoofx commented Feb 2, 2019 •

edited

Loading

bd82 commented Feb 2, 2019

xoofx commented Feb 2, 2019

Jezza commented Feb 2, 2019 •

edited

Loading

Jezza commented Feb 2, 2019 •

edited

Loading

xoofx commented Feb 3, 2019 •

edited

Loading

pradyunsg commented Feb 3, 2019 •

edited

Loading

Jezza commented Feb 3, 2019

pradyunsg commented Feb 3, 2019

xoofx commented Feb 3, 2019

pradyunsg commented Feb 3, 2019 •

edited

Loading

xoofx commented Feb 3, 2019

pradyunsg commented Feb 3, 2019

xoofx commented Feb 3, 2019

philippgl commented Mar 7, 2019 •

edited

Loading

xoofx commented Mar 8, 2019

LongTengDao commented Mar 9, 2019

pradyunsg commented May 6, 2019 •

edited

Loading

sevmeyer commented May 7, 2019

eksortso commented May 7, 2019

ChristianSi commented May 8, 2019 •

edited

Loading

pradyunsg commented May 8, 2019

ChristianSi commented May 8, 2019 •

edited

Loading

Disallow integers in keys #592

Disallow integers in keys #592

Comments

xoofx commented Feb 2, 2019 • edited Loading

pradyunsg commented Feb 2, 2019

xoofx commented Feb 2, 2019

bd82 commented Feb 2, 2019 • edited Loading

xoofx commented Feb 2, 2019 • edited Loading

bd82 commented Feb 2, 2019

Back to the high level concepts.

xoofx commented Feb 2, 2019

Jezza commented Feb 2, 2019 • edited Loading

Jezza commented Feb 2, 2019 • edited Loading

xoofx commented Feb 3, 2019 • edited Loading

pradyunsg commented Feb 3, 2019 • edited Loading

Jezza commented Feb 3, 2019

pradyunsg commented Feb 3, 2019

xoofx commented Feb 3, 2019

pradyunsg commented Feb 3, 2019 • edited Loading

xoofx commented Feb 3, 2019

pradyunsg commented Feb 3, 2019

xoofx commented Feb 3, 2019

philippgl commented Mar 7, 2019 • edited Loading

xoofx commented Mar 8, 2019

LongTengDao commented Mar 9, 2019

pradyunsg commented May 6, 2019 • edited Loading

sevmeyer commented May 7, 2019

eksortso commented May 7, 2019

ChristianSi commented May 8, 2019 • edited Loading

pradyunsg commented May 8, 2019

ChristianSi commented May 8, 2019 • edited Loading

xoofx commented Feb 2, 2019 •

edited

Loading

bd82 commented Feb 2, 2019 •

edited

Loading

xoofx commented Feb 2, 2019 •

edited

Loading

Jezza commented Feb 2, 2019 •

edited

Loading

Jezza commented Feb 2, 2019 •

edited

Loading

xoofx commented Feb 3, 2019 •

edited

Loading

pradyunsg commented Feb 3, 2019 •

edited

Loading

pradyunsg commented Feb 3, 2019 •

edited

Loading

philippgl commented Mar 7, 2019 •

edited

Loading

pradyunsg commented May 6, 2019 •

edited

Loading

ChristianSi commented May 8, 2019 •

edited

Loading

ChristianSi commented May 8, 2019 •

edited

Loading