[Parser] Lex floating point values #4693

tlively · 2022-05-26T21:51:16Z

Rather than trying to actually implement the parsing of float values, which
cannot be done naively due to precision concerns, just parse the float grammar
then postprocess the parsed text into a form we can pass to strtod to do the
actual parsing of the value.

Since the float grammar reuses num and hexnum from the integer grammar but
does not care about overflow, add a mode to LexIntCtx, num, and hexnum to
allow parsing overflowing numbers.

For NaNs, store the payload as a separate value rather than as part of the
parsed double. The payload will be injected into the NaN at a higher level of
the parser once we know whether we are parsing an f64 or an f32 and therefore
know what the allowable payload values are.

Rather than trying to actually implement the parsing of float values, which cannot be done naively due to precision concerns, just parse the float grammar then postprocess the parsed text into a form we can pass to `strtod` to do the actual parsing of the value. Since the float grammar reuses `num` and `hexnum` from the integer grammar but does not care about overflow, add a mode to `LexIntCtx`, `num`, and `hexnum` to allow parsing overflowing numbers. For NaNs, store the payload as a separate value rather than as part of the parsed double. The payload will be injected into the NaN at a higher level of the parser once we know whether we are parsing an f64 or an f32 and therefore know what the allowable payload values are.

tlively · 2022-05-26T21:51:31Z

Current dependencies on/for this PR:

main
- PR [Parser] Lex floating point values #4693 👈
  - PR [Parser][NFC] Clarify escaped string lexing #4694
    - PR [Parser][NFC] Create a public wat-lexer.h header #4695
      - PR [Parser][NFC] Remove extraneous braces from std::optional returns #4696

This comment was auto-generated by Graphite.

kripken · 2022-05-26T21:58:22Z

src/wasm/wat-parser-internal.h

+
+std::optional<int> getHexDigit(char c) {
+  if ('0' <= c && c <= '9') {
+    return {c - '0'};


Can't these be without the { }? iirc C++ will convert X to optional<X> for you.

I'll remove these and other unnecessary braces in a separate NFC PR.

Sounds good.

kripken · 2022-05-26T22:02:09Z

src/wasm/wat-parser-internal.h

+    }
+    std::string str = ss.str();
+    char* last;
+    double d = std::strtod(str.data(), &last);


The existing code uses strtof for an f32, but I'm not sure if that's necessary...

There's no way to tell whether or not we are lexing an f32 or an f64 without higher level parser context, so we have to conservatively use double precision here. I don't think there are any problems that can arise from that.

kripken · 2022-05-26T22:06:08Z

test/gtest/wat-parser.cpp

+    Lexer lexer("NaN"sv);
+    EXPECT_EQ(lexer, lexer.end());
+  }
+}


I'd say the spec tests may be good enough coverage for corner cases. But if you prefer to write out unit tests like this I'm not opposed.

We're a while off from being able to run spec tests with this new parser and the unit tests have been helpful for catching bugs.

kripken · 2022-05-26T23:14:10Z

test/gtest/wat-parser.cpp

+  {
+    Lexer lexer("nan:0x01"sv);
+    ASSERT_NE(lexer, lexer.end());
+    Token expected{"nan:0x01"sv, FloatTok{{1}, NAN}};


Do we not need these NaN payloads for some spec tests?

The token includes the NaN payload, just not as part of the NaN. The payload will be injected into the double value (or float value) at a higher level once we know whether we need an f32 or f64.

I see, sounds good.

tlively · 2022-05-27T00:59:00Z

Graphite Merge Job

Current status: ✅ Merged

This pull request was successfully merged as part of a stack.

This comment was auto-generated by Graphite.

Job Reference: A9XO0urDPXHWLrhQjA6D

aheejin

I guess most of the questions stem from my lack of knowledge about floating point representation, but anyway..

aheejin · 2022-05-27T01:09:10Z

src/wasm/wat-parser-internal.h

+    }
+    if (nanPayload) {
+      double nan = basic->span[0] == '-' ? -NAN : NAN;
+      return LexFloatResult{*basic, nanPayload, nan};


Does this initialize base class members in order and then the child class members? Didn't know this was possible..

Yep, that's my understanding. I wouldn't have thought that this would work either, but apparently it does 🤷

aheejin · 2022-05-27T01:14:46Z

src/wasm/wat-parser-internal.h

+    }
+    // strtod does not return -NAN for "-nan" on all platforms.
+    if (basic->span == "-nan"sv) {
+      return LexFloatResult{*basic, nanPayload, -NAN};


Suggested change

return LexFloatResult{*basic, nanPayload, -NAN};

return LexFloatResult{*basic, {}, -NAN};

Can nanPayload ever be non-null here, given that the if above does an early return?

No, it will always be nullopt. I can make this change in a follow-up.

aheejin · 2022-05-27T01:34:48Z

src/wasm/wat-parser-internal.h

+  // The payload if we lexed a nan with payload. We cannot store the payload
+  // directly in `d` because we do not know at this point whether we are parsing
+  // an f32 or f64 and therefore we do not know what the allowable payloads are.


For non-NaN numbers, you store as a double conservatively and fix it later. Can't we do the same for NaNs?

I thought about that, but the problem is that the default payload depends on the size of the float we are parsing. It's (2^52)-1 for f64 and (2^23)-1 for f32. If we stored the payload directly in the double to be fixed up later, then we wouldn't be able to tell the difference between a custom payload of (2^52)-1 that is out-of-bounds for an f32 and a default payload that was conservatively set to (2^53)-1 and can be fixed up. We could fix that in different ways, like having a bool isDefaultPayload, but this seems simpler overall.

aheejin · 2022-05-27T01:48:02Z

test/gtest/wat-parser.cpp

+  {
+    Lexer lexer("nan"sv);
+    ASSERT_NE(lexer, lexer.end());
+    Token expected{"nan"sv, FloatTok{{}, NAN}};
+    EXPECT_EQ(*lexer, expected);
+  }
+  {
+    Lexer lexer("+nan"sv);
+    ASSERT_NE(lexer, lexer.end());
+    Token expected{"+nan"sv, FloatTok{{}, NAN}};
+    EXPECT_EQ(*lexer, expected);
+  }
+  {
+    Lexer lexer("-nan"sv);
+    ASSERT_NE(lexer, lexer.end());
+    Token expected{"-nan"sv, FloatTok{{}, -NAN}};
+    EXPECT_EQ(*lexer, expected);
+  }


Do we not need payload for nan? The spec says 2^(signif(N)-1)

'nan' => nan(2^(signif(N)-1))

Good observation. We will interpret a null payload to mean the default payload when d is NaN and inject that default payload in the same part of the parser (not implemented yet) where we will inject non-default payloads.

I see. Might be helpful for reading to add a comment after this block:

binaryen/src/wasm/wat-parser-internal.h

Lines 602 to 615 in 590af86

if (ctx.takePrefix(":0x"sv)) {

if (auto lexed = hexnum(ctx.next())) {

ctx.take(*lexed);

if (1 <= lexed->n && lexed->n < (1ull << 52)) {

ctx.nanPayload = lexed->n;

} else {

// TODO: Add error production for invalid NaN payload.

return {};

}

} else {

// TODO: Add error production for malformed NaN payload.

return {};

}

}

aheejin · 2022-05-27T01:52:05Z

test/gtest/wat-parser.cpp

+  {
+    Lexer lexer("nan:0x0"sv);
+    ASSERT_NE(lexer, lexer.end());
+    Token expected{"nan:0x0"sv, KeywordTok{}};
+    EXPECT_EQ(*lexer, expected);
+  }
+  {
+    Lexer lexer("nan:0x10_0000_0000_0000"sv);
+    ASSERT_NE(lexer, lexer.end());
+    Token expected{"nan:0x10_0000_0000_0000"sv, KeywordTok{}};
+    EXPECT_EQ(*lexer, expected);
+  }
+  {
+    Lexer lexer("nan:0x1_0000_0000_0000_0000"sv);
+    ASSERT_NE(lexer, lexer.end());
+    Token expected{"nan:0x1_0000_0000_0000_0000"sv, KeywordTok{}};
+    EXPECT_EQ(*lexer, expected);
+  }


Aren't these 'nan:0x'n:hexnum pattern?

They would be, but n is outside the allowable bounds for f64 NaN payloads, so these fail to parse as floats. The lexer then tries to lex them as keywords, which succeeds. In the future once we have better error handling here, we should produce an error about the payload being out of bounds instead of falling back to parsing it as a keyword.

We don't even allow 0x0... 0 is also outside the allowable bounds?

That's right. Apparently if all the payload bits are zero, then the value is +-infinity rather than a NaN.

aheejin · 2022-05-27T01:53:07Z

test/gtest/wat-parser.cpp

+  {
+    Lexer lexer("Inf"sv);
+    EXPECT_EQ(lexer, lexer.end());
+  }
+  {
+    Lexer lexer("INF"sv);
+    EXPECT_EQ(lexer, lexer.end());
+  }


We don't recognize these as keywords? The same for NaN/NAN.

No, keywords can only start with lowercase letters 😆

tlively requested review from aheejin and kripken May 26, 2022 21:51

tlively mentioned this pull request May 26, 2022

[Parser][NFC] Clarify escaped string lexing #4694

Merged

kripken reviewed May 26, 2022

View reviewed changes

fixes

bcac92e

This was referenced May 26, 2022

[Parser][NFC] Create a public wat-lexer.h header #4695

Merged

[Parser][NFC] Remove extraneous braces from std::optional returns #4696

Merged

regular assert

d54d0e4

kripken approved these changes May 26, 2022

View reviewed changes

tlively merged commit 2dbc2b8 into main May 27, 2022

tlively deleted the lex-floats branch May 27, 2022 00:59

aheejin reviewed May 27, 2022

View reviewed changes

	return LexFloatResult{*basic, nanPayload, -NAN};
	return LexFloatResult{*basic, {}, -NAN};

	if (ctx.takePrefix(":0x"sv)) {
	if (auto lexed = hexnum(ctx.next())) {
	ctx.take(*lexed);
	if (1 <= lexed->n && lexed->n < (1ull << 52)) {
	ctx.nanPayload = lexed->n;
	} else {
	// TODO: Add error production for invalid NaN payload.
	return {};
	}
	} else {
	// TODO: Add error production for malformed NaN payload.
	return {};
	}
	}

[Parser] Lex floating point values #4693

[Parser] Lex floating point values #4693

Uh oh!

Conversation

tlively commented May 26, 2022

Uh oh!

tlively commented May 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlively commented May 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aheejin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tlively commented May 26, 2022 •

edited

Loading

tlively commented May 27, 2022 •

edited

Loading