Int128 Parsing support [PART 1] by BlobCodes · Pull Request #11196 · crystal-lang/crystal

BlobCodes · 2021-09-09T22:12:47Z

This PR adds:

String#to_(u/i)128(?) methods
(U)Int128 literal parsing
(U)Int128 modulo/divide/multiply-overflow methods in compiler-rt (to allow this to even work)

With this you can do crazy things like:

$ bin/crystal eval 'puts 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF_u128 - 1'
Using compiled compiler at .build/crystal
340282366920938463463374607431768211454

This is a subset of #11111 only including those changes not making the CI fail on crystal v1.1.1.

TODO:

Find areas not supported on windows and put them behind flags

As far as I have read now, the int128 methods on MSVC require the use of SIMD (result expected to be placed in SSE register XMM0) - which crystal sadly does not explicitly support.

To not introduce breaking changes, the method deduce_integer_kind now uses the following priority:
Int32 > Int64 > UInt64 > Int128 > UInt128. I don't know if this should be changed.

Requires/Includes #11093
Related to #8373
Supersedes #10975
Closes #9516
Closes #7915
Related to #5545
Closes #11191

spec/std/crystal/compiler_rt/divmod128_spec.cr

straight-shoota · 2021-09-10T10:13:58Z

spec/std/crystal/compiler_rt/divmod128_spec.cr

+# Specs ported from compiler-rt
+
+private def test__divti3(a : Int128, b : Int128, expected : Int128, file = __FILE__, line = __LINE__)
+  it "passes compiler-rt builtins unit tests" do


I know this is the same design as in the existing mulodi4_spec.cr, so don't feel obliged to change this. But I don't think there's much reason for having many individual examples (with non-descriptive names). I'd propose to place all related expecations in a single example.

Placing it all in the one example will show just the first fail instead of all.

Sure, but that doesn't really matter much. They're all expected to pass anyways.

Of course they are :D I'm saying is that you want to know exactly which ones are failing if they don't.

spec/std/int_spec.cr

src/compiler/crystal/syntax/lexer.cr

src/crystal/compiler_rt/mul.cr

Include U/Int128 popcount spec Co-authored-by: Johannes Müller <straightshoota@gmail.com>

src/crystal/compiler_rt/mul.cr

straight-shoota · 2021-09-10T13:25:56Z

The symbols missing on win32 are part of compiler-rt but libgcc provides them es well, meaning they are always available as low-level implementations when linking with gcc. Apparently they are not available in MSVC. So I suppose we'll just have to implement them ourselves as well, not just for windows but any linker that doesn't provide them on its own.

BlobCodes · 2021-09-10T17:13:26Z

I found a hack to return the lower64 bytes of a 128 integer in the compiler-rt funs.
Floats are returned using the xmm0 register, which is exactly where llvm expects its 128-bit integers.
Crystal of course still needs SIMD support for this to fully work on windows

BlobCodes · 2021-09-11T14:28:00Z

Finally, all checks passed!

BlobCodes · 2021-09-11T23:21:37Z

Hmm.. I just found a breaking change..
Binary, octal and hexadecimal numbers containing underscores are not correctly parsed (throw an exception)
String.to_i does have a underscore option, but unline the crystal compiler, it does not support multiple sequential underscores.

Now I can either introduce a breaking change (which is probably not wanted), use String.delete('_') (which is probably slow) or write/duplicate a lot of code.. hmph..

BTW: Why is there no compiler spec for this?

straight-shoota · 2021-09-12T09:23:14Z

I don't know why String#to_i fails on multiple sequential underscores. That actually seems like a rather arbitrary restriction. I certainly wouldn't have expected that, especially considering number literal syntax. I think they should really work the same way.
Changing the implementation of String#to_i to allow sequences of underscores is a technically a breaking change, but I think it's acceptable. Could even be seen as a bug fix if we consider the misalignment between String#to_i and number literals a bug.

spec/std/int_spec.cr

BlobCodes · 2021-09-12T12:51:16Z

I don't know why String#to_i fails on multiple sequential underscores. That actually seems like a rather arbitrary restriction. I certainly wouldn't have expected that, especially considering number literal syntax. I think they should really work the same way.
Changing the implementation of String#to_i to allow sequences of underscores is a technically a breaking change, but I think it's acceptable. Could even be seen as a bug fix if we consider the misalignment between String#to_i and number literals a bug.

The String#to_i implementation has a check built in just for this.

crystal/src/string.cr

Line 595 in a067d06

break if last_is_underscore

It should be very easy to change this behaviour.

yxhuvud · 2021-09-12T13:02:42Z

I don't know why String#to_i fails on multiple sequential underscores.

I'm having a hard time thinking up a use case where multiple _ would be wanted, so why allow it?

However, it fails in more cases:

puts "1_1".to_i

OTOH,

puts 1_1
puts 1__1

prints

11
11

.
But Ruby also return 1 if you do "1__1".to_i
Perhaps because Ruby fails to parse 1__1.

BlobCodes · 2021-09-12T13:05:18Z

However, it fails in more cases:
puts "1_1".to_i

puts "1_1".to_i(underscore: true) #=> 11

I think it's okay to allow multiple sequential underscores, because you have to manually allow it anyways

asterite · 2021-09-12T13:29:36Z

I don't think we should allow multiple consecutive underscores in numbers. If that's the case right now, it's a bug.

BlobCodes · 2021-09-12T13:36:29Z

Btw there's another thing about current integer parsing which I think is strange:

012 #=> Error: octal constants should be prefixed with 0o 
0_12 #=> 12

Is this also a bug or should this stay like it is?

BlobCodes · 2021-09-12T13:56:58Z

Or this bug in number parsing:

-0_u64 #=> Invalid UInt32: -0 (ArgumentError); you've found a bug in the Crystal compiler.
-0u64 #=> 0

straight-shoota · 2021-09-12T15:06:01Z

I've extracted the underscore discussion to #11203 to stay focused on the PR here.

The examples in the previous two posts all look like legitimate bugs to me. If this PR happens to fix them, that sounds good to me.

beta-ziliani · 2021-09-13T13:32:52Z

Hi @BlobCodes, thanks for the hard work you put into this! If I may ask one more thing from you, would it be possible to break this PR even more, so we can look closely each bit? I'd say one PR for every bullet point in the description would be optimal. I can try to do it myself if you prefer. 🙏

BlobCodes · 2021-09-13T16:18:46Z

This new commit completely refactors number parsing in the lexer.
The methods scan_zero_number, scan_number, scan_npow2_number, deduce_integer_kind, check_integer_literal_fits_in_size, etc. have been merged into one method scan_number (which has exactly as many LOC as the original scan_number).

The lexer.cr in this PR is now 331 LOC lighter than the lexer.cr in master.

All bugs mentioned above have been fixed. Some new rules were created:

# Before
1_.1 #=> 1.1
1_e2 #=> 100.0
-0u64 #=> 0_u64
-0_u64 #=> Invalid UInt32: -0 (ArgumentError); you've found a bug in the Crystal compiler.
1__2 #=> 12
0x_2 #=> 2
0_12 #=> 12

# After
1_.1 #=> Error: trailing '_' in number
1_e2 #=> Error: trailing '_' in number
-0u64 #=> Error: Invalid negative value -0 for UInt64
-0_u64 #=> Error: Invalid negative value -0 for UInt64
1__2 #=> Error: trailing '_' in number
0x_2 #=> Error: numeric literal without digits
0_12 #=> Error: octal constants should be prefixed with 0o

The two new rules (and error messages) were taken from ruby.
If someone thinks one of these new rules is not good, please say so!

BlobCodes · 2021-09-13T16:20:32Z

Hi @BlobCodes, thanks for the hard work you put into this! If I may ask one more thing from you, would it be possible to break this PR even more, so we can look closely each bit? I'd say one PR for every bullet point in the description would be optimal. I can try to do it myself if you prefer. pray

Alright, I'll split it up.
These three steps however all depend on each other.. (lexer depends on string parsing, string parsing depends on compiler-rt methods)
Can I already create all three PRs or is it better to create them one after another?

straight-shoota · 2021-09-13T16:29:01Z

Let's do one after the other. That's cleaner.

straight-shoota · 2021-09-13T16:43:12Z

1_.1 #=> Error: trailing '_' in number
1_e2 #=> Error: trailing '_' in number
-0u64 #=> Error: Invalid negative value -0 for UInt64
-0_u64 #=> Error: Invalid negative value -0 for UInt64
1__2 #=> Error: trailing '_' in number
0x_2 #=> Error: numeric literal without digits
0_12 #=> Error: octal constants should be prefixed with 0o

I'm not sure these changes are actually correct. IMO underscores should be allowed at any place in a number literal, including around decimal separator and literal base prefix. I don't think there is any harm in doing that, but it could allow more versatile use cases. I see no reason to put unnecessary restrictions if the intention is clear and unambiguous.

In Rust, all those literals are valid except for the unsigned -0. In contrast to Crystal, Rust allows leading zeros, hence 0_12 is accepted (because 012 is).

BlobCodes · 2021-09-13T16:58:27Z

1_.1 #=> Error: trailing '_' in number
1_e2 #=> Error: trailing '_' in number
-0u64 #=> Error: Invalid negative value -0 for UInt64
-0_u64 #=> Error: Invalid negative value -0 for UInt64
1__2 #=> Error: trailing '_' in number
0x_2 #=> Error: numeric literal without digits
0_12 #=> Error: octal constants should be prefixed with 0o
I'm not sure these changes are actually correct. IMO underscores should be allowed at any place in a number literal, including around decimal separator and literal base prefix. I don't think there is any harm in doing that, but it could allow more versatile use cases. I see no reason to put unnecessary restrictions if the intention is clear and unambiguous.

In Rust, all those literals are valid except for the unsigned -0. In contrast to Crystal, Rust allows leading zeros, hence 0_12 is accepted (because 012 is).

Hmm.. In ruby, those are all throwing errors.
As far as I have seen from the comments above, I think "1__2" should definitely be disallowed.

straight-shoota · 2021-09-13T17:35:53Z

(0_12 actually works in Ruby, it's interpreted as an octal literal (same as 012). Interestingly, 0o_12 is an error though.)

I don't see any convincing argument why the literals 1_.1, 1_e2, 1__2 and 0x_2 should become errors. If we were starting from scratch, it wouldn't be a different discussions. But they are currently valid and part of the language. I don't think there's enough benefit to justify the costs of changing that. Especially not as a incidental byproduct of a feature addition.

If somebody wants to propose such a change, they should start a dedicated discussion about that. But let's not mingle it with 128-bit support.

Sija · 2021-09-13T17:40:42Z

They certainly look like errors, are super-rarely used - if at all, are undocumented, so I don't get why shouldn't they be made errors - which they are - according to the non-existing specs and documentation - i.e. they were never designed to work that way.

BlobCodes · 2021-09-13T17:44:27Z

But let's not mingle it with 128-bit support.

Hmm.. yeah, that's fair. But allowing int128 parsing requires a lexer refactor anyways, so I think there should be a discussion about this (because it's not that much added work).

BlobCodes · 2021-09-14T22:01:14Z

IMO underscores should be allowed at any place in a number literal, including around decimal separator and literal base prefix

"including around decimal separator"
The number "1._1" would currently be recognized as calling the method _1 on the int32 1.
I think it should not be allowed to put an underscore before a decimal seperator too - that's just inconsistent

With String.to_i already raising when an integer has multiple underscores, I think this should be considered an error in the lexer too (Ary even called it a bug).
Searching on github for two sequential underscores returns zero number literals - nobody would actually do something like that

I don't see any convincing argument why the literals 1_.1, 1_e2, 1__2 and 0x_2 should become errors

I can kind of understand that you think 1_e2 or 0x_2 shouldn't become errors, but I think this is worthy of discussion

Anyways, the lexer refactor has been seperated into #11211 - let's continue the discussions there

oprypin · 2021-10-25T03:13:16Z

Is this PR still usable or was it superseded? Should it be closed?

straight-shoota · 2021-10-25T09:07:25Z

It should have been entirely superseded by #11206, #11245 and #11211 (the latter is still pending).

BlobCodes · 2021-10-25T15:57:47Z

Is this PR still usable or was it superseded? Should it be closed?

There are still some changes in this PR that are not included anywhere else.
The lexer refactor in #11211 does not include 128-bit support, it only tidies everything up and removes the need for a fixed-size integer, so int128 support can easily be implemented (as well as any other integer size).
I could include the 128-bit support there, but I don't know if that would be more work for the maintainers.

Also, #11211 includes some bug fixes to integer parsing and some opinionated changes which still need to be discussed in #11203 and #11214

BlobCodes · 2021-12-11T14:42:25Z

Replaced by #11571

BlobCodes added 2 commits September 10, 2021 00:04

Add int128 parsing, different compiler_rt methods and specs

fdb52dc

Crystal tool format

8e001b9

straight-shoota reviewed Sep 10, 2021

View reviewed changes

straight-shoota added kind:feature topic:compiler:parser topic:stdlib:numeric labels Sep 10, 2021

BlobCodes and others added 5 commits September 10, 2021 13:13

Update spec/std/int_spec.cr

21a3421

Include U/Int128 popcount spec Co-authored-by: Johannes Müller <straightshoota@gmail.com>

Make specs compile on windows

1254065

Fix specs (Int128::MIN is already negative)

0a77b75

Add source of mulo{tds}i implementation

6adb83e

Crystal tool format

3b3634d

Sija reviewed Sep 10, 2021

View reviewed changes

src/crystal/compiler_rt/mul.cr Show resolved Hide resolved

BlobCodes added 8 commits September 11, 2021 13:00

Get int signness from type name

64187a2

Change order of u/int suffix consumption

2b1ba7b

Assign string_value directly from case

c5e1279

crystal tool format

3c7d7a1

Make std_specs work on windows

c8378b7

remove nested it's in string_spec

dae48fb

Try to fix missing symbols error in win32 compiler specs

fd8a77e

Fix arithmetics_spec constants on win32

e3e8458

BlobCodes mentioned this pull request Sep 11, 2021

Support U/Int128 parsing and U/Int128 literals #11111

Closed

Sija reviewed Sep 12, 2021

View reviewed changes

spec/std/int_spec.cr Show resolved Hide resolved

straight-shoota mentioned this pull request Sep 12, 2021

Consecutive underscores in number literals and number parsing methods #11203

Closed

Completely refactor lexer number parsing

b91eb8c

BlobCodes added a commit to BlobCodes/crystal that referenced this pull request Sep 13, 2021

compiler-rt methods from crystal-lang#11196

a0b1764

BlobCodes added a commit to BlobCodes/crystal that referenced this pull request Sep 13, 2021

specs from crystal-lang#11196

1aa17f7

BlobCodes mentioned this pull request Sep 13, 2021

Int128 compiler-rt methods (Int128 literal support part 1) #11206

Merged

BlobCodes added 2 commits September 13, 2021 19:59

set token raw in lexer number parsing

f64d70a

Fix CI (remove a fix that didn't fix anything)

5e0fd92

BlobCodes mentioned this pull request Sep 14, 2021

Lexer number parsing refactor #11211

Merged

straight-shoota mentioned this pull request Sep 15, 2021

Underscores in number literals #11214

Closed

BlobCodes mentioned this pull request Dec 11, 2021

Implement lexer int128 support #11571

Merged

BlobCodes closed this Dec 11, 2021

BlobCodes deleted the int128-parsing-part1 branch January 29, 2022 22:39

Uh oh!

Conversation

BlobCodes commented Sep 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

straight-shoota Sep 10, 2021

Choose a reason for hiding this comment

Uh oh!

Sija Sep 10, 2021

Choose a reason for hiding this comment

Uh oh!

straight-shoota Sep 10, 2021

Choose a reason for hiding this comment

Uh oh!

Sija Sep 11, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

straight-shoota commented Sep 10, 2021

Uh oh!

BlobCodes commented Sep 10, 2021

Uh oh!

BlobCodes commented Sep 11, 2021

Uh oh!

BlobCodes commented Sep 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

straight-shoota commented Sep 12, 2021

Uh oh!

Uh oh!

BlobCodes commented Sep 12, 2021

Uh oh!

yxhuvud commented Sep 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BlobCodes commented Sep 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asterite commented Sep 12, 2021

Uh oh!

BlobCodes commented Sep 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BlobCodes commented Sep 12, 2021

Uh oh!

straight-shoota commented Sep 12, 2021

Uh oh!

beta-ziliani commented Sep 13, 2021

Uh oh!

BlobCodes commented Sep 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BlobCodes commented Sep 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

straight-shoota commented Sep 13, 2021

Uh oh!

straight-shoota commented Sep 13, 2021

Uh oh!

BlobCodes commented Sep 13, 2021

Uh oh!

straight-shoota commented Sep 13, 2021

Uh oh!

Sija commented Sep 13, 2021

Uh oh!

BlobCodes commented Sep 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BlobCodes commented Sep 14, 2021

Uh oh!

oprypin commented Oct 25, 2021

Uh oh!

straight-shoota commented Oct 25, 2021

BlobCodes commented Sep 9, 2021 •

edited

Loading

BlobCodes commented Sep 11, 2021 •

edited

Loading

yxhuvud commented Sep 12, 2021 •

edited

Loading

BlobCodes commented Sep 12, 2021 •

edited

Loading

BlobCodes commented Sep 12, 2021 •

edited

Loading

BlobCodes commented Sep 13, 2021 •

edited

Loading

BlobCodes commented Sep 13, 2021 •

edited

Loading

BlobCodes commented Sep 13, 2021 •

edited

Loading