Skip to content

Int128 Parsing support [PART 1]#11196

Closed
BlobCodes wants to merge 18 commits intocrystal-lang:masterfrom
BlobCodes:int128-parsing-part1
Closed

Int128 Parsing support [PART 1]#11196
BlobCodes wants to merge 18 commits intocrystal-lang:masterfrom
BlobCodes:int128-parsing-part1

Conversation

@BlobCodes
Copy link
Contributor

@BlobCodes BlobCodes commented Sep 9, 2021

This PR adds:

  • String#to_(u/i)128(?) methods
  • (U)Int128 literal parsing
  • (U)Int128 modulo/divide/multiply-overflow methods in compiler-rt (to allow this to even work)

With this you can do crazy things like:

$ bin/crystal eval 'puts 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF_u128 - 1'
Using compiled compiler at .build/crystal
340282366920938463463374607431768211454

This is a subset of #11111 only including those changes not making the CI fail on crystal v1.1.1.

TODO:

  • Find areas not supported on windows and put them behind flags

As far as I have read now, the int128 methods on MSVC require the use of SIMD (result expected to be placed in SSE register XMM0) - which crystal sadly does not explicitly support.

To not introduce breaking changes, the method deduce_integer_kind now uses the following priority:
Int32 > Int64 > UInt64 > Int128 > UInt128. I don't know if this should be changed.

Requires/Includes #11093
Related to #8373
Supersedes #10975
Closes #9516
Closes #7915
Related to #5545
Closes #11191

# Specs ported from compiler-rt

private def test__divti3(a : Int128, b : Int128, expected : Int128, file = __FILE__, line = __LINE__)
it "passes compiler-rt builtins unit tests" do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is the same design as in the existing mulodi4_spec.cr, so don't feel obliged to change this. But I don't think there's much reason for having many individual examples (with non-descriptive names). I'd propose to place all related expecations in a single example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placing it all in the one example will show just the first fail instead of all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but that doesn't really matter much. They're all expected to pass anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course they are :D I'm saying is that you want to know exactly which ones are failing if they don't.

@straight-shoota
Copy link
Member

The symbols missing on win32 are part of compiler-rt but libgcc provides them es well, meaning they are always available as low-level implementations when linking with gcc. Apparently they are not available in MSVC. So I suppose we'll just have to implement them ourselves as well, not just for windows but any linker that doesn't provide them on its own.

@BlobCodes
Copy link
Contributor Author

I found a hack to return the lower64 bytes of a 128 integer in the compiler-rt funs.
Floats are returned using the xmm0 register, which is exactly where llvm expects its 128-bit integers.
Crystal of course still needs SIMD support for this to fully work on windows

@BlobCodes
Copy link
Contributor Author

Finally, all checks passed!

@BlobCodes
Copy link
Contributor Author

BlobCodes commented Sep 11, 2021

Hmm.. I just found a breaking change..
Binary, octal and hexadecimal numbers containing underscores are not correctly parsed (throw an exception)
String.to_i does have a underscore option, but unline the crystal compiler, it does not support multiple sequential underscores.

Now I can either introduce a breaking change (which is probably not wanted), use String.delete('_') (which is probably slow) or write/duplicate a lot of code.. hmph..

BTW: Why is there no compiler spec for this?

@straight-shoota
Copy link
Member

I don't know why String#to_i fails on multiple sequential underscores. That actually seems like a rather arbitrary restriction. I certainly wouldn't have expected that, especially considering number literal syntax. I think they should really work the same way.
Changing the implementation of String#to_i to allow sequences of underscores is a technically a breaking change, but I think it's acceptable. Could even be seen as a bug fix if we consider the misalignment between String#to_i and number literals a bug.

@BlobCodes
Copy link
Contributor Author

I don't know why String#to_i fails on multiple sequential underscores. That actually seems like a rather arbitrary restriction. I certainly wouldn't have expected that, especially considering number literal syntax. I think they should really work the same way.
Changing the implementation of String#to_i to allow sequences of underscores is a technically a breaking change, but I think it's acceptable. Could even be seen as a bug fix if we consider the misalignment between String#to_i and number literals a bug.

The String#to_i implementation has a check built in just for this.

break if last_is_underscore

It should be very easy to change this behaviour.

@yxhuvud
Copy link
Contributor

yxhuvud commented Sep 12, 2021

I don't know why String#to_i fails on multiple sequential underscores.

I'm having a hard time thinking up a use case where multiple _ would be wanted, so why allow it?

However, it fails in more cases:

puts "1_1".to_i

OTOH,

puts 1_1
puts 1__1

prints

11
11

.
But Ruby also return 1 if you do "1__1".to_i
Perhaps because Ruby fails to parse 1__1.

@BlobCodes
Copy link
Contributor Author

BlobCodes commented Sep 12, 2021

However, it fails in more cases:

puts "1_1".to_i
puts "1_1".to_i(underscore: true) #=> 11

I think it's okay to allow multiple sequential underscores, because you have to manually allow it anyways

@asterite
Copy link
Member

I don't think we should allow multiple consecutive underscores in numbers. If that's the case right now, it's a bug.

@BlobCodes
Copy link
Contributor Author

BlobCodes commented Sep 12, 2021

Btw there's another thing about current integer parsing which I think is strange:

012 #=> Error: octal constants should be prefixed with 0o 
0_12 #=> 12

Is this also a bug or should this stay like it is?

@BlobCodes
Copy link
Contributor Author

Or this bug in number parsing:

-0_u64 #=> Invalid UInt32: -0 (ArgumentError); you've found a bug in the Crystal compiler.
-0u64 #=> 0

@straight-shoota
Copy link
Member

I've extracted the underscore discussion to #11203 to stay focused on the PR here.

The examples in the previous two posts all look like legitimate bugs to me. If this PR happens to fix them, that sounds good to me.

@beta-ziliani
Copy link
Member

Hi @BlobCodes, thanks for the hard work you put into this! If I may ask one more thing from you, would it be possible to break this PR even more, so we can look closely each bit? I'd say one PR for every bullet point in the description would be optimal. I can try to do it myself if you prefer. 🙏

@BlobCodes
Copy link
Contributor Author

BlobCodes commented Sep 13, 2021

This new commit completely refactors number parsing in the lexer.
The methods scan_zero_number, scan_number, scan_npow2_number, deduce_integer_kind, check_integer_literal_fits_in_size, etc. have been merged into one method scan_number (which has exactly as many LOC as the original scan_number).

The lexer.cr in this PR is now 331 LOC lighter than the lexer.cr in master.

All bugs mentioned above have been fixed. Some new rules were created:

# Before
1_.1 #=> 1.1
1_e2 #=> 100.0
-0u64 #=> 0_u64
-0_u64 #=> Invalid UInt32: -0 (ArgumentError); you've found a bug in the Crystal compiler.
1__2 #=> 12
0x_2 #=> 2
0_12 #=> 12

# After
1_.1 #=> Error: trailing '_' in number
1_e2 #=> Error: trailing '_' in number
-0u64 #=> Error: Invalid negative value -0 for UInt64
-0_u64 #=> Error: Invalid negative value -0 for UInt64
1__2 #=> Error: trailing '_' in number
0x_2 #=> Error: numeric literal without digits
0_12 #=> Error: octal constants should be prefixed with 0o

The two new rules (and error messages) were taken from ruby.
If someone thinks one of these new rules is not good, please say so!

@BlobCodes
Copy link
Contributor Author

BlobCodes commented Sep 13, 2021

Hi @BlobCodes, thanks for the hard work you put into this! If I may ask one more thing from you, would it be possible to break this PR even more, so we can look closely each bit? I'd say one PR for every bullet point in the description would be optimal. I can try to do it myself if you prefer. pray

Alright, I'll split it up.
These three steps however all depend on each other.. (lexer depends on string parsing, string parsing depends on compiler-rt methods)
Can I already create all three PRs or is it better to create them one after another?

@straight-shoota
Copy link
Member

Let's do one after the other. That's cleaner.

@straight-shoota
Copy link
Member

1_.1 #=> Error: trailing '_' in number
1_e2 #=> Error: trailing '_' in number
-0u64 #=> Error: Invalid negative value -0 for UInt64
-0_u64 #=> Error: Invalid negative value -0 for UInt64
1__2 #=> Error: trailing '_' in number
0x_2 #=> Error: numeric literal without digits
0_12 #=> Error: octal constants should be prefixed with 0o

I'm not sure these changes are actually correct. IMO underscores should be allowed at any place in a number literal, including around decimal separator and literal base prefix. I don't think there is any harm in doing that, but it could allow more versatile use cases. I see no reason to put unnecessary restrictions if the intention is clear and unambiguous.

In Rust, all those literals are valid except for the unsigned -0. In contrast to Crystal, Rust allows leading zeros, hence 0_12 is accepted (because 012 is).

@BlobCodes
Copy link
Contributor Author

1_.1 #=> Error: trailing '_' in number
1_e2 #=> Error: trailing '_' in number
-0u64 #=> Error: Invalid negative value -0 for UInt64
-0_u64 #=> Error: Invalid negative value -0 for UInt64
1__2 #=> Error: trailing '_' in number
0x_2 #=> Error: numeric literal without digits
0_12 #=> Error: octal constants should be prefixed with 0o

I'm not sure these changes are actually correct. IMO underscores should be allowed at any place in a number literal, including around decimal separator and literal base prefix. I don't think there is any harm in doing that, but it could allow more versatile use cases. I see no reason to put unnecessary restrictions if the intention is clear and unambiguous.

In Rust, all those literals are valid except for the unsigned -0. In contrast to Crystal, Rust allows leading zeros, hence 0_12 is accepted (because 012 is).

Hmm.. In ruby, those are all throwing errors.
As far as I have seen from the comments above, I think "1__2" should definitely be disallowed.

BlobCodes added a commit to BlobCodes/crystal that referenced this pull request Sep 13, 2021
BlobCodes added a commit to BlobCodes/crystal that referenced this pull request Sep 13, 2021
@straight-shoota
Copy link
Member

(0_12 actually works in Ruby, it's interpreted as an octal literal (same as 012). Interestingly, 0o_12 is an error though.)

I don't see any convincing argument why the literals 1_.1, 1_e2, 1__2 and 0x_2 should become errors. If we were starting from scratch, it wouldn't be a different discussions. But they are currently valid and part of the language. I don't think there's enough benefit to justify the costs of changing that. Especially not as a incidental byproduct of a feature addition.

If somebody wants to propose such a change, they should start a dedicated discussion about that. But let's not mingle it with 128-bit support.

@Sija
Copy link
Contributor

Sija commented Sep 13, 2021

They certainly look like errors, are super-rarely used - if at all, are undocumented, so I don't get why shouldn't they be made errors - which they are - according to the non-existing specs and documentation - i.e. they were never designed to work that way.

@BlobCodes
Copy link
Contributor Author

BlobCodes commented Sep 13, 2021

But let's not mingle it with 128-bit support.

Hmm.. yeah, that's fair. But allowing int128 parsing requires a lexer refactor anyways, so I think there should be a discussion about this (because it's not that much added work).

@BlobCodes
Copy link
Contributor Author

IMO underscores should be allowed at any place in a number literal, including around decimal separator and literal base prefix

"including around decimal separator"
The number "1._1" would currently be recognized as calling the method _1 on the int32 1.
I think it should not be allowed to put an underscore before a decimal seperator too - that's just inconsistent


With String.to_i already raising when an integer has multiple underscores, I think this should be considered an error in the lexer too (Ary even called it a bug).
Searching on github for two sequential underscores returns zero number literals - nobody would actually do something like that


I don't see any convincing argument why the literals 1_.1, 1_e2, 1__2 and 0x_2 should become errors

I can kind of understand that you think 1_e2 or 0x_2 shouldn't become errors, but I think this is worthy of discussion


Anyways, the lexer refactor has been seperated into #11211 - let's continue the discussions there

@oprypin
Copy link
Member

oprypin commented Oct 25, 2021

Is this PR still usable or was it superseded? Should it be closed?

@straight-shoota
Copy link
Member

It should have been entirely superseded by #11206, #11245 and #11211 (the latter is still pending).

@BlobCodes
Copy link
Contributor Author

Is this PR still usable or was it superseded? Should it be closed?

There are still some changes in this PR that are not included anywhere else.
The lexer refactor in #11211 does not include 128-bit support, it only tidies everything up and removes the need for a fixed-size integer, so int128 support can easily be implemented (as well as any other integer size).
I could include the 128-bit support there, but I don't know if that would be more work for the maintainers.

Also, #11211 includes some bug fixes to integer parsing and some opinionated changes which still need to be discussed in #11203 and #11214

@BlobCodes
Copy link
Contributor Author

Replaced by #11571

@BlobCodes BlobCodes closed this Dec 11, 2021
@BlobCodes BlobCodes deleted the int128-parsing-part1 branch January 29, 2022 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

8 participants