-
Notifications
You must be signed in to change notification settings - Fork 830
[Parser] Lex floating point values #4693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Rather than trying to actually implement the parsing of float values, which cannot be done naively due to precision concerns, just parse the float grammar then postprocess the parsed text into a form we can pass to `strtod` to do the actual parsing of the value. Since the float grammar reuses `num` and `hexnum` from the integer grammar but does not care about overflow, add a mode to `LexIntCtx`, `num`, and `hexnum` to allow parsing overflowing numbers. For NaNs, store the payload as a separate value rather than as part of the parsed double. The payload will be injected into the NaN at a higher level of the parser once we know whether we are parsing an f64 or an f32 and therefore know what the allowable payload values are.
|
Current dependencies on/for this PR:
This comment was auto-generated by Graphite. |
|
|
||
| std::optional<int> getHexDigit(char c) { | ||
| if ('0' <= c && c <= '9') { | ||
| return {c - '0'}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't these be without the { }? iirc C++ will convert X to optional<X> for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove these and other unnecessary braces in a separate NFC PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
| } | ||
| std::string str = ss.str(); | ||
| char* last; | ||
| double d = std::strtod(str.data(), &last); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing code uses strtof for an f32, but I'm not sure if that's necessary...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no way to tell whether or not we are lexing an f32 or an f64 without higher level parser context, so we have to conservatively use double precision here. I don't think there are any problems that can arise from that.
| Lexer lexer("NaN"sv); | ||
| EXPECT_EQ(lexer, lexer.end()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say the spec tests may be good enough coverage for corner cases. But if you prefer to write out unit tests like this I'm not opposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're a while off from being able to run spec tests with this new parser and the unit tests have been helpful for catching bugs.
| { | ||
| Lexer lexer("nan:0x01"sv); | ||
| ASSERT_NE(lexer, lexer.end()); | ||
| Token expected{"nan:0x01"sv, FloatTok{{1}, NAN}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not need these NaN payloads for some spec tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The token includes the NaN payload, just not as part of the NaN. The payload will be injected into the double value (or float value) at a higher level once we know whether we need an f32 or f64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, sounds good.
|
Graphite Merge Job Current status: ✅ Merged This pull request was successfully merged as part of a stack. This comment was auto-generated by Graphite. Job Reference: A9XO0urDPXHWLrhQjA6D |
aheejin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess most of the questions stem from my lack of knowledge about floating point representation, but anyway..
| } | ||
| if (nanPayload) { | ||
| double nan = basic->span[0] == '-' ? -NAN : NAN; | ||
| return LexFloatResult{*basic, nanPayload, nan}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this initialize base class members in order and then the child class members? Didn't know this was possible..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's my understanding. I wouldn't have thought that this would work either, but apparently it does 🤷
| } | ||
| // strtod does not return -NAN for "-nan" on all platforms. | ||
| if (basic->span == "-nan"sv) { | ||
| return LexFloatResult{*basic, nanPayload, -NAN}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return LexFloatResult{*basic, nanPayload, -NAN}; | |
| return LexFloatResult{*basic, {}, -NAN}; |
Can nanPayload ever be non-null here, given that the if above does an early return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it will always be nullopt. I can make this change in a follow-up.
| // The payload if we lexed a nan with payload. We cannot store the payload | ||
| // directly in `d` because we do not know at this point whether we are parsing | ||
| // an f32 or f64 and therefore we do not know what the allowable payloads are. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For non-NaN numbers, you store as a double conservatively and fix it later. Can't we do the same for NaNs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, but the problem is that the default payload depends on the size of the float we are parsing. It's (2^52)-1 for f64 and (2^23)-1 for f32. If we stored the payload directly in the double to be fixed up later, then we wouldn't be able to tell the difference between a custom payload of (2^52)-1 that is out-of-bounds for an f32 and a default payload that was conservatively set to (2^53)-1 and can be fixed up. We could fix that in different ways, like having a bool isDefaultPayload, but this seems simpler overall.
| { | ||
| Lexer lexer("nan"sv); | ||
| ASSERT_NE(lexer, lexer.end()); | ||
| Token expected{"nan"sv, FloatTok{{}, NAN}}; | ||
| EXPECT_EQ(*lexer, expected); | ||
| } | ||
| { | ||
| Lexer lexer("+nan"sv); | ||
| ASSERT_NE(lexer, lexer.end()); | ||
| Token expected{"+nan"sv, FloatTok{{}, NAN}}; | ||
| EXPECT_EQ(*lexer, expected); | ||
| } | ||
| { | ||
| Lexer lexer("-nan"sv); | ||
| ASSERT_NE(lexer, lexer.end()); | ||
| Token expected{"-nan"sv, FloatTok{{}, -NAN}}; | ||
| EXPECT_EQ(*lexer, expected); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not need payload for nan? The spec says 2^(signif(N)-1)
'nan' => nan(2^(signif(N)-1))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation. We will interpret a null payload to mean the default payload when d is NaN and inject that default payload in the same part of the parser (not implemented yet) where we will inject non-default payloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Might be helpful for reading to add a comment after this block:
binaryen/src/wasm/wat-parser-internal.h
Lines 602 to 615 in 590af86
| if (ctx.takePrefix(":0x"sv)) { | |
| if (auto lexed = hexnum(ctx.next())) { | |
| ctx.take(*lexed); | |
| if (1 <= lexed->n && lexed->n < (1ull << 52)) { | |
| ctx.nanPayload = lexed->n; | |
| } else { | |
| // TODO: Add error production for invalid NaN payload. | |
| return {}; | |
| } | |
| } else { | |
| // TODO: Add error production for malformed NaN payload. | |
| return {}; | |
| } | |
| } |
| { | ||
| Lexer lexer("nan:0x0"sv); | ||
| ASSERT_NE(lexer, lexer.end()); | ||
| Token expected{"nan:0x0"sv, KeywordTok{}}; | ||
| EXPECT_EQ(*lexer, expected); | ||
| } | ||
| { | ||
| Lexer lexer("nan:0x10_0000_0000_0000"sv); | ||
| ASSERT_NE(lexer, lexer.end()); | ||
| Token expected{"nan:0x10_0000_0000_0000"sv, KeywordTok{}}; | ||
| EXPECT_EQ(*lexer, expected); | ||
| } | ||
| { | ||
| Lexer lexer("nan:0x1_0000_0000_0000_0000"sv); | ||
| ASSERT_NE(lexer, lexer.end()); | ||
| Token expected{"nan:0x1_0000_0000_0000_0000"sv, KeywordTok{}}; | ||
| EXPECT_EQ(*lexer, expected); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't these 'nan:0x'n:hexnum pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They would be, but n is outside the allowable bounds for f64 NaN payloads, so these fail to parse as floats. The lexer then tries to lex them as keywords, which succeeds. In the future once we have better error handling here, we should produce an error about the payload being out of bounds instead of falling back to parsing it as a keyword.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't even allow 0x0... 0 is also outside the allowable bounds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. Apparently if all the payload bits are zero, then the value is +-infinity rather than a NaN.
| { | ||
| Lexer lexer("Inf"sv); | ||
| EXPECT_EQ(lexer, lexer.end()); | ||
| } | ||
| { | ||
| Lexer lexer("INF"sv); | ||
| EXPECT_EQ(lexer, lexer.end()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't recognize these as keywords? The same for NaN/NAN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, keywords can only start with lowercase letters 😆

Rather than trying to actually implement the parsing of float values, which
cannot be done naively due to precision concerns, just parse the float grammar
then postprocess the parsed text into a form we can pass to
strtodto do theactual parsing of the value.
Since the float grammar reuses
numandhexnumfrom the integer grammar butdoes not care about overflow, add a mode to
LexIntCtx,num, andhexnumtoallow parsing overflowing numbers.
For NaNs, store the payload as a separate value rather than as part of the
parsed double. The payload will be injected into the NaN at a higher level of
the parser once we know whether we are parsing an f64 or an f32 and therefore
know what the allowable payload values are.