Skip to content

Conversation

@tlively
Copy link
Member

@tlively tlively commented May 26, 2022

Rather than trying to actually implement the parsing of float values, which
cannot be done naively due to precision concerns, just parse the float grammar
then postprocess the parsed text into a form we can pass to strtod to do the
actual parsing of the value.

Since the float grammar reuses num and hexnum from the integer grammar but
does not care about overflow, add a mode to LexIntCtx, num, and hexnum to
allow parsing overflowing numbers.

For NaNs, store the payload as a separate value rather than as part of the
parsed double. The payload will be injected into the NaN at a higher level of
the parser once we know whether we are parsing an f64 or an f32 and therefore
know what the allowable payload values are.

Rather than trying to actually implement the parsing of float values, which
cannot be done naively due to precision concerns, just parse the float grammar
then postprocess the parsed text into a form we can pass to `strtod` to do the
actual parsing of the value.

Since the float grammar reuses `num` and `hexnum` from the integer grammar but
does not care about overflow, add a mode to `LexIntCtx`, `num`, and `hexnum` to
allow parsing overflowing numbers.

For NaNs, store the payload as a separate value rather than as part of the
parsed double. The payload will be injected into the NaN at a higher level of
the parser once we know whether we are parsing an f64 or an f32 and therefore
know what the allowable payload values are.
@tlively tlively requested review from aheejin and kripken May 26, 2022 21:51
@tlively
Copy link
Member Author

tlively commented May 26, 2022


std::optional<int> getHexDigit(char c) {
if ('0' <= c && c <= '9') {
return {c - '0'};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't these be without the { }? iirc C++ will convert X to optional<X> for you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove these and other unnecessary braces in a separate NFC PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

}
std::string str = ss.str();
char* last;
double d = std::strtod(str.data(), &last);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code uses strtof for an f32, but I'm not sure if that's necessary...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no way to tell whether or not we are lexing an f32 or an f64 without higher level parser context, so we have to conservatively use double precision here. I don't think there are any problems that can arise from that.

Lexer lexer("NaN"sv);
EXPECT_EQ(lexer, lexer.end());
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the spec tests may be good enough coverage for corner cases. But if you prefer to write out unit tests like this I'm not opposed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're a while off from being able to run spec tests with this new parser and the unit tests have been helpful for catching bugs.

{
Lexer lexer("nan:0x01"sv);
ASSERT_NE(lexer, lexer.end());
Token expected{"nan:0x01"sv, FloatTok{{1}, NAN}};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need these NaN payloads for some spec tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The token includes the NaN payload, just not as part of the NaN. The payload will be injected into the double value (or float value) at a higher level once we know whether we need an f32 or f64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, sounds good.

@tlively
Copy link
Member Author

tlively commented May 27, 2022

Graphite Merge Job

Current status: ✅ Merged

This pull request was successfully merged as part of a stack.

This comment was auto-generated by Graphite.

Job Reference: A9XO0urDPXHWLrhQjA6D

@tlively tlively merged commit 2dbc2b8 into main May 27, 2022
@tlively tlively deleted the lex-floats branch May 27, 2022 00:59
Copy link
Member

@aheejin aheejin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess most of the questions stem from my lack of knowledge about floating point representation, but anyway..

}
if (nanPayload) {
double nan = basic->span[0] == '-' ? -NAN : NAN;
return LexFloatResult{*basic, nanPayload, nan};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this initialize base class members in order and then the child class members? Didn't know this was possible..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's my understanding. I wouldn't have thought that this would work either, but apparently it does 🤷

}
// strtod does not return -NAN for "-nan" on all platforms.
if (basic->span == "-nan"sv) {
return LexFloatResult{*basic, nanPayload, -NAN};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return LexFloatResult{*basic, nanPayload, -NAN};
return LexFloatResult{*basic, {}, -NAN};

Can nanPayload ever be non-null here, given that the if above does an early return?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it will always be nullopt. I can make this change in a follow-up.

Comment on lines +820 to +822
// The payload if we lexed a nan with payload. We cannot store the payload
// directly in `d` because we do not know at this point whether we are parsing
// an f32 or f64 and therefore we do not know what the allowable payloads are.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-NaN numbers, you store as a double conservatively and fix it later. Can't we do the same for NaNs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but the problem is that the default payload depends on the size of the float we are parsing. It's (2^52)-1 for f64 and (2^23)-1 for f32. If we stored the payload directly in the double to be fixed up later, then we wouldn't be able to tell the difference between a custom payload of (2^52)-1 that is out-of-bounds for an f32 and a default payload that was conservatively set to (2^53)-1 and can be fixed up. We could fix that in different ways, like having a bool isDefaultPayload, but this seems simpler overall.

Comment on lines +723 to +740
{
Lexer lexer("nan"sv);
ASSERT_NE(lexer, lexer.end());
Token expected{"nan"sv, FloatTok{{}, NAN}};
EXPECT_EQ(*lexer, expected);
}
{
Lexer lexer("+nan"sv);
ASSERT_NE(lexer, lexer.end());
Token expected{"+nan"sv, FloatTok{{}, NAN}};
EXPECT_EQ(*lexer, expected);
}
{
Lexer lexer("-nan"sv);
ASSERT_NE(lexer, lexer.end());
Token expected{"-nan"sv, FloatTok{{}, -NAN}};
EXPECT_EQ(*lexer, expected);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need payload for nan? The spec says 2^(signif(N)-1)

'nan' => nan(2^(signif(N)-1))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. We will interpret a null payload to mean the default payload when d is NaN and inject that default payload in the same part of the parser (not implemented yet) where we will inject non-default payloads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Might be helpful for reading to add a comment after this block:

if (ctx.takePrefix(":0x"sv)) {
if (auto lexed = hexnum(ctx.next())) {
ctx.take(*lexed);
if (1 <= lexed->n && lexed->n < (1ull << 52)) {
ctx.nanPayload = lexed->n;
} else {
// TODO: Add error production for invalid NaN payload.
return {};
}
} else {
// TODO: Add error production for malformed NaN payload.
return {};
}
}

Comment on lines +808 to +825
{
Lexer lexer("nan:0x0"sv);
ASSERT_NE(lexer, lexer.end());
Token expected{"nan:0x0"sv, KeywordTok{}};
EXPECT_EQ(*lexer, expected);
}
{
Lexer lexer("nan:0x10_0000_0000_0000"sv);
ASSERT_NE(lexer, lexer.end());
Token expected{"nan:0x10_0000_0000_0000"sv, KeywordTok{}};
EXPECT_EQ(*lexer, expected);
}
{
Lexer lexer("nan:0x1_0000_0000_0000_0000"sv);
ASSERT_NE(lexer, lexer.end());
Token expected{"nan:0x1_0000_0000_0000_0000"sv, KeywordTok{}};
EXPECT_EQ(*lexer, expected);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these 'nan:0x'n:hexnum pattern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They would be, but n is outside the allowable bounds for f64 NaN payloads, so these fail to parse as floats. The lexer then tries to lex them as keywords, which succeeds. In the future once we have better error handling here, we should produce an error about the payload being out of bounds instead of falling back to parsing it as a keyword.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't even allow 0x0... 0 is also outside the allowable bounds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Apparently if all the payload bits are zero, then the value is +-infinity rather than a NaN.

Comment on lines +706 to +713
{
Lexer lexer("Inf"sv);
EXPECT_EQ(lexer, lexer.end());
}
{
Lexer lexer("INF"sv);
EXPECT_EQ(lexer, lexer.end());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't recognize these as keywords? The same for NaN/NAN.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, keywords can only start with lowercase letters 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants