Improve Numeric matching to support full range of float64 #188

baldawar · 2024-09-18T04:32:51Z

Issue #, if available:

179

Description of changes:

This change follows the guidance from #179 on using 10 byte base-128 encoded format for numbers similar to how Quamina does it.

Didn't see any performance implications of supporting the new range, but had to fix a bunch of tests. I will be changing the numbers we use for testing to better test the new range of numbers before merging.

During debugging, I found it challenging to make sense of the numbers to I've also added a helper method in ComparableNumbers and modified toString() methods in few places.

Benchmark / Performance (for source code changes):

Benchmark                                                  (benchmark)   Mode  Cnt       Score       Error  Units
CityLots2JmhBenchmarks.group01Simple                             EXACT  thrpt   10  413930.986 ± 11450.753  ops/s
CityLots2JmhBenchmarks.group01Simple                          WILDCARD  thrpt   10  352480.174 ±  6750.778  ops/s
CityLots2JmhBenchmarks.group01Simple                            PREFIX  thrpt   10  418264.642 ± 19729.068  ops/s
CityLots2JmhBenchmarks.group01Simple   PREFIX_EQUALS_IGNORE_CASE_RULES  thrpt   10  415142.905 ± 18319.570  ops/s
CityLots2JmhBenchmarks.group01Simple                            SUFFIX  thrpt   10  419590.039 ± 15463.837  ops/s
CityLots2JmhBenchmarks.group01Simple   SUFFIX_EQUALS_IGNORE_CASE_RULES  thrpt   10  410364.141 ± 12610.649  ops/s
CityLots2JmhBenchmarks.group01Simple                EQUALS_IGNORE_CASE  thrpt   10  382570.869 ±  5316.951  ops/s
CityLots2JmhBenchmarks.group01Simple                           NUMERIC  thrpt   10  256233.479 ±  3662.693  ops/s
CityLots2JmhBenchmarks.group01Simple                      ANYTHING-BUT  thrpt   10  276676.782 ±   958.215  ops/s
CityLots2JmhBenchmarks.group01Simple          ANYTHING-BUT-IGNORE-CASE  thrpt   10  274642.171 ±  5587.937  ops/s
CityLots2JmhBenchmarks.group01Simple               ANYTHING-BUT-PREFIX  thrpt   10  288814.025 ±  1302.395  ops/s
CityLots2JmhBenchmarks.group01Simple               ANYTHING-BUT-SUFFIX  thrpt   10  284559.927 ±  3258.949  ops/s
CityLots2JmhBenchmarks.group01Simple             ANYTHING-BUT-WILDCARD  thrpt   10  303269.268 ±   922.894  ops/s
CityLots2JmhBenchmarks.group02Complex                   COMPLEX_ARRAYS  thrpt    6   54274.020 ±   567.870  ops/s
CityLots2JmhBenchmarks.group02Complex                    PARTIAL_COMBO  thrpt    6  102749.040 ±  2765.021  ops/s
CityLots2JmhBenchmarks.group02Complex                            COMBO  thrpt    6   35555.689 ±   312.623  ops/s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

timbray

LGTM, a couple of nonblocking comments to look at

timbray · 2024-09-19T17:58:13Z

src/main/software/amazon/event/ruler/ComparableNumber.java

- * {@code 0.30d - 0.10d = 0.19999999999999998} instead of {@code 0.2d}. When extended to {@code 1e10}, the test
- * results show that only 5 decimal places of precision can be guaranteed when using doubles.
+ * The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue
+ * where parsing directly to {@code Double} can loose precision when parsing doubles. It's


"loose" should be "lose"

timbray · 2024-09-19T17:58:32Z

src/main/software/amazon/event/ruler/ComparableNumber.java

- * results show that only 5 decimal places of precision can be guaranteed when using doubles.
+ * The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue
+ * where parsing directly to {@code Double} can loose precision when parsing doubles. It's
+ * probably possible to wider ranges with our current implementation of parsing strings to


grammar, "possible to wider ranges"

timbray · 2024-09-19T18:00:17Z

src/main/software/amazon/event/ruler/ComparableNumber.java

-            throw new IllegalArgumentException("Value must be between " + -ComparableNumber.HALF_TRILLION +
-                    " and " + ComparableNumber.HALF_TRILLION + ", inclusive");
+        // make sure we have the comparable numbers and haven't eaten up decimals values
+        if(Double.isNaN(doubleValue) || Double.isInfinite(doubleValue) ||


Can this happen? JSON doesn't allow numeric values of NaN and Inf

not in JSON but there's interest in supporting other formats so I'm bullet proofing this now than be in for a surprise in future.

You can check both in one go based on the bits-representation. With a constant long NOT_FINITE = 0x7ff0L << 52: (bits & NOT_FINITE) == NOT_FINITE will be true for only NaN and +/- Infinity.

Maybe the JVM automagically already inlines the two calls like that. If not, it could help, here.

JVM does inline it. I've left this as it for readability for now.

timbray · 2024-09-19T18:02:13Z

src/main/software/amazon/event/ruler/ComparableNumber.java

+     * <br/>
+     * As shown in Quamina's numbits.go, it's possible to use variable length encoding
+     * to reduce storage for simple (often common) numbers but it's not done here to
+     * keep range comparisons simple for now.


Haha, ever since Quamina got variable-width numbits I've been worrying about the range comparison code. sigh now I guess I have to figure that out.

Yeah. I gave it a shot but found it hard to get right. I wanted to get back to building $and support, so took the easier route. I figured I can come back to cleaning up my branch later.

timbray · 2024-09-19T18:04:17Z

src/main/software/amazon/event/ruler/ComparableNumber.java

+     * @param value the long value to be converted
+     * @return the Base128 encoded string representation of the input value
+     */
+    public static String numbits(long value) {


Each char in a java String is 2 bytes because they are UTF-16 codepoints. You could save a lot of memory if you could work on the UTF-8 bytes. But I realize that this comment applies to all of Ruler I guess.

that's fair. It would be worth a look when we dig into memory & cpu improvements down the line.

How are Strings-by-chars sorted in Event Ruler? Binary value or with some collation? If it's binary... you can still use the same scheme but store 2 7-bit bytes per char. Might require shuffling the bytes around if the endianness of the char 2-byte-tuples is different. But as long as leaving the 0x8080 bits out stays clear of combiners or possibly causing illegal codepoints... you should be good.

... or use the lower 15 bits. simpler than a 0x8080 mask. Still, check that you get by with only one codepoint and that it's not reserved or illegal in UTF16

Comparing ranges got trickier with using 2 7-bit bytes per char, so I've parked the idea for now. Will come back to it later.

In case anyone takes a stab at it in future, here's the reference code for this :

static final int MAX_LENGTH_IN_BYTES = 5; ... public static String numbits(long value) { char[] result = new char[MAX_LENGTH_IN_BYTES]; int index = MAX_LENGTH_IN_BYTES - 1; while (value != 0 && index >= 0) { int lowBits = (int)(value & BASE_128_BITMASK); value >>= 7; int highBits = (int)(value & BASE_128_BITMASK); value >>= 7; result[index--] = (char)((highBits << 7) | lowBits); } return new String(result); }

... Tim told me quamina internally works different. Quamina needs the UTF8 validity because it uses UTF8-illegal byte values as markers.
Is ruler different enough so you don't depend on that? In that case, you could also just cut it into 4 different shorts. Unless that destroys some string operations you need, e.g by producing hanging combiners etc.

For range comparison, Ruler has a lookup table that changes incoming numbers into a sequence of digits https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/Range.java#L128 . Changing this requires more effort and I suspect not as beneficial compared to supporting variable length for numbers. So I chose to work on variable length first and then come back to compressing the numbers later (though I'm hopeful that I'd be able to do both once I get more into refactoring this part of the code more)

Quamina needs the UTF8 validity because it uses UTF8-illegal byte values as markers.
Is ruler different enough so you don't depend on that

I don't know the answer for this on the top of mind. Will check on it later but I was originally suspecting it should be fine as long as I address any edge cases within the look-up table part of the code.

arnehormann

@baldawar/@timbray my 2cts as promised... ping me if you need more details

arnehormann · 2024-09-20T19:57:22Z

src/main/software/amazon/event/ruler/ComparableNumber.java

- * 1. It saves 3 bytes of memory per number compared to decimal representation.
- * 2. It is lexicographically comparable, which is useful for maintaining sorted order of numbers.
- * 2. It aligns with the radix used for IP addresses.
+ * We use Base128 encoding offers a compact representation of decimal numbers


s/We use/The used/

arnehormann · 2024-09-20T20:04:08Z

src/main/software/amazon/event/ruler/ComparableNumber.java

- * other data types. The higher the number, the lower the accuracy that can be maintained. For example,
- * {@code 0.30d - 0.10d = 0.19999999999999998} instead of {@code 0.2d}. When extended to {@code 1e10}, the test
- * results show that only 5 decimal places of precision can be guaranteed when using doubles.
+ * The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue


Can you provide a reference or details for that?

Double handling is finicky and has lots of potential foot guns. But I never ran into a loss of precision that's not mediated by decimal-binary conversion issues. If you e.g. use Double.toHexString and the "p" representation of exponents in your numerical tests, you should be able to avoid these issues.
This only applies to where you want to use numbers and not the strings, of course. But at some point, the JSON number strings will be converted to double and you'll run into the same issue.
Unless the JSON parsers also avoid Double. But then, you should use the same code path the JSON parser uses.

You won't lose fidelity. But you'll gain some performance.

I noticed the issue with37.807807921694092 when parsing Doubles from strings (so Double.valueOf(...) and when doing double y = 37.807807921694092. Both return 37.80780792169409, so its probably tied to the decimal-binary conversion problems. I wasn't sure that JSON number strings will have the same issue or not as some systems using favor direct string transformations and bypass JSON parsing altogether.

A double represented as a decimal stops being accurate after 15 digits (unless it ends in 5). this number has 17.
A good way to keep it in mind: 10 bits fit 3 decimal digits (1000 < 1024). a double has 52 mantissa bits, so this buys you a fractional amount more than 15 digits with 50 bits. the other 2 (or 3 if you include the implicit one) are not enough for a 16th digit. So it will... distribute... the rest - a fraction more than 8 alternatives - over 10 candidate digits. As powers of two are binary, a number ending in 5 will be at an advantage, here.

And BTW this is a problem with the citylots file, those floats in there have more precision than is safe with JSON. So it's probably not useful for supporting numeric-match testing.

Oh wow. I did not realize that. Let me look into changing city lots to have a healthy mix of numbers with less than 15 decimals.

arnehormann · 2024-09-20T20:10:12Z

src/main/software/amazon/event/ruler/ComparableNumber.java

+        // if high bit is 0, we want to xor with sign bit 1 << 63, else negate (xor with ^0). Meaning,
+        // bits >= 0, mask = 1000000000000000000000000000000000000000000000000000000000000000
+        // bits < 0,  mask = 1111111111111111111111111111111111111111111111111111111111111111
+        final  long mask = ((bits >>> 63) * 0xFFFFFFFFFFFFFFFFL) | (1L  << 63);


Raph Levien proposed an even nicer way on Mastodon (not yet in Quamina afaik).
(bits >> 63) | (1L << 63) is sufficient. For negative numbers, the sign bit will be set. >> is sign extending and will fill all the left bits with the sign bit. No need for another constant.

sweet. Thanks. If you have the link to the tweet (toot?) from Raph, I'd love to link it for attribution.

https://mastodon.online/@raph/113071041069390831 ... his was for Go, not Java - but it's the same idea

arnehormann · 2024-09-20T20:18:32Z

src/main/software/amazon/event/ruler/ComparableNumber.java

+     * @param value the long value to be converted
+     * @return the Base128 encoded string representation of the input value
+     */
+    public static String numbits(long value) {


How are Strings-by-chars sorted in Event Ruler? Binary value or with some collation? If it's binary... you can still use the same scheme but store 2 7-bit bytes per char. Might require shuffling the bytes around if the endianness of the char 2-byte-tuples is different. But as long as leaving the 0x8080 bits out stays clear of combiners or possibly causing illegal codepoints... you should be good.

arnehormann · 2024-09-20T20:27:21Z

src/main/software/amazon/event/ruler/ComparableNumber.java

-            throw new IllegalArgumentException("Value must be between " + -ComparableNumber.HALF_TRILLION +
-                    " and " + ComparableNumber.HALF_TRILLION + ", inclusive");
+        // make sure we have the comparable numbers and haven't eaten up decimals values
+        if(Double.isNaN(doubleValue) || Double.isInfinite(doubleValue) ||


You can check both in one go based on the bits-representation. With a constant long NOT_FINITE = 0x7ff0L << 52: (bits & NOT_FINITE) == NOT_FINITE will be true for only NaN and +/- Infinity.

Maybe the JVM automagically already inlines the two calls like that. If not, it could help, here.

baldawar added 5 commits September 17, 2024 21:04

Initial commit for supporting float64 numbers

9587870

rev2 based on pull request diff

687efb2

rev3 : undo bad rebase

184967e

checkstyle errors

594b03b

removing flaky test cases

252d9ab

baldawar marked this pull request as ready for review September 19, 2024 00:38

timbray approved these changes Sep 19, 2024

View reviewed changes

minor javadoc fixes

e671aef

baldawar merged commit 1406b64 into main Sep 19, 2024
4 checks passed

arnehormann reviewed Sep 20, 2024

View reviewed changes

baldawar mentioned this pull request Sep 26, 2024

Minor updates to ComparableNumbers based on PR feedback. #190

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Numeric matching to support full range of float64 #188

Improve Numeric matching to support full range of float64 #188

baldawar commented Sep 18, 2024 •

edited

Loading

timbray left a comment

timbray Sep 19, 2024

timbray Sep 19, 2024

timbray Sep 19, 2024

baldawar Sep 19, 2024

arnehormann Sep 20, 2024

baldawar Sep 26, 2024

timbray Sep 19, 2024

baldawar Sep 19, 2024

timbray Sep 19, 2024

baldawar Sep 19, 2024

arnehormann Sep 20, 2024

arnehormann Sep 21, 2024

baldawar Sep 26, 2024

arnehormann Sep 27, 2024

baldawar Sep 27, 2024

baldawar Sep 27, 2024

arnehormann left a comment

arnehormann Sep 20, 2024

arnehormann Sep 20, 2024

baldawar Sep 26, 2024 •

edited

Loading

arnehormann Sep 27, 2024 •

edited

Loading

timbray Sep 27, 2024

baldawar Sep 27, 2024

arnehormann Sep 20, 2024

baldawar Sep 26, 2024

arnehormann Sep 26, 2024 •

edited

Loading

arnehormann Sep 20, 2024

arnehormann Sep 20, 2024

Improve Numeric matching to support full range of float64 #188

Improve Numeric matching to support full range of float64 #188

Conversation

baldawar commented Sep 18, 2024 • edited Loading

Issue #, if available:

Description of changes:

Benchmark / Performance (for source code changes):

timbray left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnehormann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

baldawar Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

arnehormann Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnehormann Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

baldawar commented Sep 18, 2024 •

edited

Loading

baldawar Sep 26, 2024 •

edited

Loading

arnehormann Sep 27, 2024 •

edited

Loading

arnehormann Sep 26, 2024 •

edited

Loading