-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Numeric matching to support full range of float64 #188
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a couple of nonblocking comments to look at
* {@code 0.30d - 0.10d = 0.19999999999999998} instead of {@code 0.2d}. When extended to {@code 1e10}, the test | ||
* results show that only 5 decimal places of precision can be guaranteed when using doubles. | ||
* The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue | ||
* where parsing directly to {@code Double} can loose precision when parsing doubles. It's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"loose" should be "lose"
* results show that only 5 decimal places of precision can be guaranteed when using doubles. | ||
* The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue | ||
* where parsing directly to {@code Double} can loose precision when parsing doubles. It's | ||
* probably possible to wider ranges with our current implementation of parsing strings to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grammar, "possible to wider ranges"
throw new IllegalArgumentException("Value must be between " + -ComparableNumber.HALF_TRILLION + | ||
" and " + ComparableNumber.HALF_TRILLION + ", inclusive"); | ||
// make sure we have the comparable numbers and haven't eaten up decimals values | ||
if(Double.isNaN(doubleValue) || Double.isInfinite(doubleValue) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this happen? JSON doesn't allow numeric values of NaN and Inf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not in JSON but there's interest in supporting other formats so I'm bullet proofing this now than be in for a surprise in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check both in one go based on the bits-representation. With a constant long NOT_FINITE = 0x7ff0L << 52
: (bits & NOT_FINITE) == NOT_FINITE
will be true for only NaN and +/- Infinity.
Maybe the JVM automagically already inlines the two calls like that. If not, it could help, here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JVM does inline it. I've left this as it for readability for now.
* <br/> | ||
* As shown in Quamina's numbits.go, it's possible to use variable length encoding | ||
* to reduce storage for simple (often common) numbers but it's not done here to | ||
* keep range comparisons simple for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, ever since Quamina got variable-width numbits I've been worrying about the range comparison code. sigh now I guess I have to figure that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I gave it a shot but found it hard to get right. I wanted to get back to building $and support, so took the easier route. I figured I can come back to cleaning up my branch later.
* @param value the long value to be converted | ||
* @return the Base128 encoded string representation of the input value | ||
*/ | ||
public static String numbits(long value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each char in a java String is 2 bytes because they are UTF-16 codepoints. You could save a lot of memory if you could work on the UTF-8 bytes. But I realize that this comment applies to all of Ruler I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fair. It would be worth a look when we dig into memory & cpu improvements down the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are Strings-by-chars sorted in Event Ruler? Binary value or with some collation? If it's binary... you can still use the same scheme but store 2 7-bit bytes per char. Might require shuffling the bytes around if the endianness of the char 2-byte-tuples is different. But as long as leaving the 0x8080 bits out stays clear of combiners or possibly causing illegal codepoints... you should be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or use the lower 15 bits. simpler than a 0x8080 mask. Still, check that you get by with only one codepoint and that it's not reserved or illegal in UTF16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing ranges got trickier with using 2 7-bit bytes per char, so I've parked the idea for now. Will come back to it later.
In case anyone takes a stab at it in future, here's the reference code for this :
static final int MAX_LENGTH_IN_BYTES = 5;
...
public static String numbits(long value) {
char[] result = new char[MAX_LENGTH_IN_BYTES];
int index = MAX_LENGTH_IN_BYTES - 1;
while (value != 0 && index >= 0) {
int lowBits = (int)(value & BASE_128_BITMASK);
value >>= 7;
int highBits = (int)(value & BASE_128_BITMASK);
value >>= 7;
result[index--] = (char)((highBits << 7) | lowBits);
}
return new String(result);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... Tim told me quamina internally works different. Quamina needs the UTF8 validity because it uses UTF8-illegal byte values as markers.
Is ruler different enough so you don't depend on that? In that case, you could also just cut it into 4 different shorts. Unless that destroys some string operations you need, e.g by producing hanging combiners etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For range comparison, Ruler has a lookup table that changes incoming numbers into a sequence of digits https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/Range.java#L128 . Changing this requires more effort and I suspect not as beneficial compared to supporting variable length for numbers. So I chose to work on variable length first and then come back to compressing the numbers later (though I'm hopeful that I'd be able to do both once I get more into refactoring this part of the code more)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quamina needs the UTF8 validity because it uses UTF8-illegal byte values as markers.
Is ruler different enough so you don't depend on that
I don't know the answer for this on the top of mind. Will check on it later but I was originally suspecting it should be fine as long as I address any edge cases within the look-up table part of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@baldawar/@timbray my 2cts as promised... ping me if you need more details
* 1. It saves 3 bytes of memory per number compared to decimal representation. | ||
* 2. It is lexicographically comparable, which is useful for maintaining sorted order of numbers. | ||
* 2. It aligns with the radix used for IP addresses. | ||
* We use Base128 encoding offers a compact representation of decimal numbers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/We use/The used/
* other data types. The higher the number, the lower the accuracy that can be maintained. For example, | ||
* {@code 0.30d - 0.10d = 0.19999999999999998} instead of {@code 0.2d}. When extended to {@code 1e10}, the test | ||
* results show that only 5 decimal places of precision can be guaranteed when using doubles. | ||
* The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a reference or details for that?
Double handling is finicky and has lots of potential foot guns. But I never ran into a loss of precision that's not mediated by decimal-binary conversion issues. If you e.g. use Double.toHexString and the "p" representation of exponents in your numerical tests, you should be able to avoid these issues.
This only applies to where you want to use numbers and not the strings, of course. But at some point, the JSON number strings will be converted to double and you'll run into the same issue.
Unless the JSON parsers also avoid Double. But then, you should use the same code path the JSON parser uses.
You won't lose fidelity. But you'll gain some performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed the issue with37.807807921694092
when parsing Doubles from strings (so Double.valueOf(...)
and when doing double y = 37.807807921694092
. Both return 37.80780792169409
, so its probably tied to the decimal-binary conversion problems. I wasn't sure that JSON number strings will have the same issue or not as some systems using favor direct string transformations and bypass JSON parsing altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A double represented as a decimal stops being accurate after 15 digits (unless it ends in 5). this number has 17.
A good way to keep it in mind: 10 bits fit 3 decimal digits (1000 < 1024). a double has 52 mantissa bits, so this buys you a fractional amount more than 15 digits with 50 bits. the other 2 (or 3 if you include the implicit one) are not enough for a 16th digit. So it will... distribute... the rest - a fraction more than 8 alternatives - over 10 candidate digits. As powers of two are binary, a number ending in 5 will be at an advantage, here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And BTW this is a problem with the citylots file, those floats in there have more precision than is safe with JSON. So it's probably not useful for supporting numeric-match testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow. I did not realize that. Let me look into changing city lots to have a healthy mix of numbers with less than 15 decimals.
// if high bit is 0, we want to xor with sign bit 1 << 63, else negate (xor with ^0). Meaning, | ||
// bits >= 0, mask = 1000000000000000000000000000000000000000000000000000000000000000 | ||
// bits < 0, mask = 1111111111111111111111111111111111111111111111111111111111111111 | ||
final long mask = ((bits >>> 63) * 0xFFFFFFFFFFFFFFFFL) | (1L << 63); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raph Levien proposed an even nicer way on Mastodon (not yet in Quamina afaik).
(bits >> 63) | (1L << 63)
is sufficient. For negative numbers, the sign bit will be set. >>
is sign extending and will fill all the left bits with the sign bit. No need for another constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sweet. Thanks. If you have the link to the tweet (toot?) from Raph, I'd love to link it for attribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://mastodon.online/@raph/113071041069390831 ... his was for Go, not Java - but it's the same idea
* @param value the long value to be converted | ||
* @return the Base128 encoded string representation of the input value | ||
*/ | ||
public static String numbits(long value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are Strings-by-chars sorted in Event Ruler? Binary value or with some collation? If it's binary... you can still use the same scheme but store 2 7-bit bytes per char. Might require shuffling the bytes around if the endianness of the char 2-byte-tuples is different. But as long as leaving the 0x8080 bits out stays clear of combiners or possibly causing illegal codepoints... you should be good.
throw new IllegalArgumentException("Value must be between " + -ComparableNumber.HALF_TRILLION + | ||
" and " + ComparableNumber.HALF_TRILLION + ", inclusive"); | ||
// make sure we have the comparable numbers and haven't eaten up decimals values | ||
if(Double.isNaN(doubleValue) || Double.isInfinite(doubleValue) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check both in one go based on the bits-representation. With a constant long NOT_FINITE = 0x7ff0L << 52
: (bits & NOT_FINITE) == NOT_FINITE
will be true for only NaN and +/- Infinity.
Maybe the JVM automagically already inlines the two calls like that. If not, it could help, here.
Issue #, if available:
179
Description of changes:
This change follows the guidance from #179 on using 10 byte base-128 encoded format for numbers similar to how Quamina does it.
Didn't see any performance implications of supporting the new range, but had to fix a bunch of tests. I will be changing the numbers we use for testing to better test the new range of numbers before merging.
During debugging, I found it challenging to make sense of the numbers to I've also added a helper method in
ComparableNumbers
and modifiedtoString()
methods in few places.Benchmark / Performance (for source code changes):
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.