Improve Numeric matching to support full range of float64 (#188)

This change follows the guidance from #179 on using 10 byte base-128 encoded format for numbers similar to how Quamina does it. Didn't see any performance implications of supporting the new range, but had to fix a bunch of tests. I will be changing the numbers we use for testing to better test the new range of numbers before merging. During debugging, I found it challenging to make sense of the numbers to I've also added a helper method in ComparableNumbers and modified toString() methods in few places.
aws · Sep 19, 2024 · 1406b64 · 1406b64
1 parent c5dc202
commit 1406b64
Show file tree

Hide file tree

Showing 14 changed files with 239 additions and 276 deletions.
diff --git a/README.md b/README.md
@@ -273,10 +273,8 @@ Anything-but wildcard list (strings):
 ```
 
 Above, the references to `c-count`, `d-count`, and `x-limit` illustrate numeric matching,
-and only
-work with values that are JSON numbers.  Numeric matching is limited to value between
--5.0e11 and +5.0e11 inclusive, with 17 digits of precision, that is to say 6 digits
-to the right of the decimal point.
+and only work with values that are JSON numbers.  Numeric matching supports the same 
+precision and range as Java's `double` primitive which implements IEEE 754 `binary64` standard. 
 
 ### IP Address Matching
 ```javascript

diff --git a/pom.xml b/pom.xml
@@ -20,7 +20,7 @@
   <groupId>software.amazon.event.ruler</groupId>
   <artifactId>event-ruler</artifactId>
   <name>Event Ruler</name>
-  <version>1.7.6</version>
+  <version>1.8.0</version>
   <description>Event Ruler is a Java library that allows matching Rules to Events. An event is a list of fields,
     which may be given as name/value pairs or as a JSON object. A rule associates event field names with lists of
     possible values. There are two reasons to use Ruler: 1/ It's fast; the time it takes to match Events doesn't

diff --git a/src/main/software/amazon/event/ruler/ComparableNumber.java b/src/main/software/amazon/event/ruler/ComparableNumber.java
@@ -3,104 +3,122 @@
 import ch.randelshofer.fastdoubleparser.JavaBigDecimalParser;
 
 import java.math.BigDecimal;
-
-import static software.amazon.event.ruler.Constants.BASE64_DIGITS;
-import static software.amazon.event.ruler.Constants.MIN_NUM_DIGIT;
+import java.nio.charset.StandardCharsets;
+import java.util.Arrays;
+import java.util.List;
 
 /**
  * Represents a number as a comparable string.
  * <br/>
- * Numbers are allowed in the range -500,000,000,000 to +500,000,000,000 (inclusive).
- * Comparisons are precise to 17 decimal places, with six to the right of the decimal.
- * Numbers are treated as floating-point values.
- * <br>
- * Numbers are converted to strings by:
- * 1. Multiplying by 1,000,000 to remove the decimal point and then adding 500,000,000,000 (to remove negatives), then
- * 2. Formatting to a 12-character to base64 string with padding, because the base64 string
- *     converted from 500,000,000,000 * 1,000,000 = 500,000,000,000,000,000 has 12 characters.
+ * All possible double numbers (IEEE-754 binary64 standard) are allowed.
+ * Numbers are first standardized to floating-point values and then converted
+ * to a Base128 encoded string of 10 bytes.
  * <br/>
- * Hexadecimal representation is used because:
- * 1. It saves 3 bytes of memory per number compared to decimal representation.
- * 2. It is lexicographically comparable, which is useful for maintaining sorted order of numbers.
- * 2. It aligns with the radix used for IP addresses.
+ * We use Base128 encoding offers a compact representation of decimal numbers
+ * as it preserves the lexicographical order of the numbers. See
+ * https://github.com/aws/event-ruler/issues/179 for more context.
  * <br/>
- * The number is parsed as a Java {@code BigDecimal} to support decimal fractions. We're avoiding double as
- * there is a well-known issue that double numbers can lose precision when performing calculations involving
- * other data types. The higher the number, the lower the accuracy that can be maintained. For example,
- * {@code 0.30d - 0.10d = 0.19999999999999998} instead of {@code 0.2d}. When extended to {@code 1e10}, the test
- * results show that only 5 decimal places of precision can be guaranteed when using doubles.
+ * The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue
+ * where parsing directly to {@code Double} can lose precision when parsing doubles. It's
+ * probably possible to support wider ranges with our current implementation of parsing strings to
+ * BigDecimal, but it's not worth the effort as JSON also support upto float64 range. In
+ * case this requirement changes, it would be advisable to move away from using {@code Doubles}
+ * and {@code Long} in this class.
  * <br/>
  * CAVEAT:
- * The current range of +/- 500,000,000,000 is selected as a balance between maintaining the committed 6
- * decimal places of precision and memory cost (each number is parsed into a 12-character hexadecimal string).
+ * There are precision and memory implications of the implementation here.
  * When trying to increase the maximum number, PLEASE BE VERY CAREFUL TO PRESERVE THE NUMBER PRECISION AND
  * CONSIDER THE MEMORY COST.
- * <br/>
- * Also, while {@code BigDecimal} can ensure the precision of double calculations, it has been shown to be
- * 2-4 times slower for basic mathematical and comparison operations, so we turn to long integer arithmetic.
- * This will need to be modified if we ever need to support larger numbers.
  */
 class ComparableNumber {
-    // Use scientific notation to define the double number directly to avoid losing Precision by calculation
-    // for example 5000 * 1000 *1000 will be wrongly parsed as 7.05032704E8 by computer.
-    static final double HALF_TRILLION = 5E11;
 
-    static final int MAX_LENGTH_IN_BYTES = 16;
-    static final int MAX_DECIMAL_PRECISON = 6;
+    static final int MAX_LENGTH_IN_BYTES = 10;
+    static final int BASE_128_BITMASK = 0x7f; // 127 or 01111111
 
-    public static final BigDecimal TEN_E_SIX = new BigDecimal("1E6"); // to remove decimals
-    public static final long HALF_TRILLION_TEN_E_SIX = new BigDecimal(ComparableNumber.HALF_TRILLION).multiply(TEN_E_SIX).longValueExact();
-
-    private ComparableNumber() {
-    }
+    private ComparableNumber() {}
 
     /**
-     * Generates a hexadecimal string representation of a given decimal string value,
-     * with a maximum precision of 6 decimal places and a range between -5,000,000,000
-     * and 5,000,000,000 (inclusive).
+     * Generates a comparable number string from a given string representation
+     * using numbits representation.
      *
-     * @param str the decimal string value to be converted
-     * @return the hexadecimal string representation of the input value
-     * @throws IllegalArgumentException if the input value has more than 6 decimal places
-     *                                  or is outside the allowed range
+     * @param str the string representation of the number
+     * @return the comparable number string
+     * @throws NumberFormatException if the input isn't a number
+     * @throws IllegalArgumentException if the input isn't a number we can compare
      */
     static String generate(final String str) {
-        final BigDecimal number = JavaBigDecimalParser.parseBigDecimal(str).stripTrailingZeros();
-        if (number.scale() > MAX_DECIMAL_PRECISON) {
-            throw new IllegalArgumentException("Only values upto 6 decimals are supported");
-        }
-
-        final long shiftedBySixDecimals = number.multiply(TEN_E_SIX).longValueExact();
+        final BigDecimal bigDecimal = JavaBigDecimalParser.parseBigDecimal(str);
+        final double doubleValue = bigDecimal.doubleValue();
 
-        // faster than doing bigDecimal comparisons
-        if (shiftedBySixDecimals < -HALF_TRILLION_TEN_E_SIX || shiftedBySixDecimals > HALF_TRILLION_TEN_E_SIX) {
-            throw new IllegalArgumentException("Value must be between " + -ComparableNumber.HALF_TRILLION +
-                    " and " + ComparableNumber.HALF_TRILLION + ", inclusive");
+        // make sure we have the comparable numbers and haven't eaten up decimals values
+        if(Double.isNaN(doubleValue) || Double.isInfinite(doubleValue) ||
+                BigDecimal.valueOf(doubleValue).compareTo(bigDecimal) != 0) {
+            throw new IllegalArgumentException("Cannot compare number : " + str);
         }
+        final long bits = Double.doubleToRawLongBits(doubleValue);
 
-        return longToBase64Bytes(shiftedBySixDecimals + HALF_TRILLION_TEN_E_SIX);
+        // if high bit is 0, we want to xor with sign bit 1 << 63, else negate (xor with ^0). Meaning,
+        // bits >= 0, mask = 1000000000000000000000000000000000000000000000000000000000000000
+        // bits < 0,  mask = 1111111111111111111111111111111111111111111111111111111111111111
+        final  long mask = ((bits >>> 63) * 0xFFFFFFFFFFFFFFFFL) | (1L  << 63);
+        return numbits(bits ^ mask );
     }
 
-    public static String longToBase64Bytes(long value) {
-        if (value < 0) {
-            throw new IllegalArgumentException("Input value must be non-negative");
+    /**
+     * Converts a long value to a Base128 encoded string representation.
+     * <br/>
+     * The Base128 encoding scheme is a way to represent a long value as a sequence
+     * of bytes, where each byte encodes 7 bits of the original value. This allows for
+     * efficient storage and transmission of large numbers.
+     * <br/>
+     * The method first determines the number of trailing zero bytes in the input
+     * value by iterating over the bytes from the most significant byte to the least
+     * significant byte, and counting the number of consecutive zero bytes at the end.
+     * It then creates a byte array of fixed length {@code MAX_LENGTH_IN_BYTES} and
+     * populates it with the Base128 encoded bytes of the input value, starting from
+     * the least significant byte.
+     * <br/>
+     * As shown in Quamina's numbits.go, it's possible to use variable length encoding
+     * to reduce storage for simple (often common) numbers but it's not done here to
+     * keep range comparisons simple for now.
+     *
+     * @param value the long value to be converted
+     * @return the Base128 encoded string representation of the input value
+     */
+    public static String numbits(long value) {
+        int trailingZeroes = 0;
+        int index;
+        // Count the number of trailing zero bytes to skip setting them
+        for(index = MAX_LENGTH_IN_BYTES - 1; index >= 0; index--) {
+            if((value & BASE_128_BITMASK) != 0) {
+                break;
+            }
+            trailingZeroes ++;
+            value >>= 7;
         }
 
-        char[] bytes = new char[12]; // Maximum length of base-64 encoded long is 12 bytes
-        int index = 11;
-
-        while (value > 0) {
-            int digit = (int) (value & 0x3F); // Get the lowest 6 bits
-            bytes[index--] = (char) BASE64_DIGITS[digit];
-            value >>= 6; // Shift the value right by 6 bits
-        }
+        byte[] result = new byte[MAX_LENGTH_IN_BYTES];
 
-        while(index >= 0) { // left padding
-            bytes[index--] = (char) MIN_NUM_DIGIT;
+        // Populate the byte array with the Base128 encoded bytes of the input value
+        for(; index >= 0; index--) {
+            result[index] = (byte)(value & BASE_128_BITMASK);
+            value >>= 7;
         }
 
-        return new String(bytes);
+        return new String(result, StandardCharsets.UTF_8);
     }
 
+    /**
+     * This is a utility function for debugging and tests.
+     * Converts a given string into a list of integers, where each integer represents
+     * the ASCII value of the corresponding character in the string.
+     */
+    static List<Integer> toIntVals(String s) {
+        Integer[] arr = new Integer[s.length()];
+        for (int i=0; i<s.length(); i++) {
+            arr[i] = (int)s.charAt(i);
+        }
+        return Arrays.asList(arr);
+    }
 }
 
diff --git a/src/main/software/amazon/event/ruler/Constants.java b/src/main/software/amazon/event/ruler/Constants.java
@@ -38,18 +38,16 @@ private Constants() {
   final static byte MAX_HEX_DIGIT = HEX_DIGITS[HEX_DIGITS.length - 1]; // F
   final static byte MIN_HEX_DIGIT = HEX_DIGITS[0]; // 0
 
-  static final byte[] BASE64_DIGITS = {
-          // numbers are ordered intentionally to based on ascii table value
-          '+', '/',
-          '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
-          'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
-          'W', 'X', 'Y', 'Z',
-          'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v',
-          'w', 'x', 'y', 'z',
-  };
+  static final byte[] BASE128_DIGITS = new byte[128];
+
+  static {
+    for (int i = 0; i < BASE128_DIGITS.length; i++) {
+      BASE128_DIGITS[i] = (byte) i;
+    }
+  }
 
-  final static byte MAX_NUM_DIGIT = BASE64_DIGITS[BASE64_DIGITS.length - 1]; // z
-  final static byte MIN_NUM_DIGIT = BASE64_DIGITS[0]; // +
+  final static byte MAX_NUM_DIGIT = BASE128_DIGITS[BASE128_DIGITS.length - 1];
+  final static byte MIN_NUM_DIGIT = BASE128_DIGITS[0];
 
   final static List<String> RESERVED_FIELD_NAMES_IN_OR_RELATIONSHIP = Arrays.asList(
       EXACT_MATCH,

diff --git a/src/main/software/amazon/event/ruler/Range.java b/src/main/software/amazon/event/ruler/Range.java
@@ -4,7 +4,7 @@
 import java.util.Arrays;
 import java.util.function.Function;
 
-import static software.amazon.event.ruler.Constants.BASE64_DIGITS;
+import static software.amazon.event.ruler.Constants.BASE128_DIGITS;
 import static software.amazon.event.ruler.Constants.HEX_DIGITS;
 import static software.amazon.event.ruler.Constants.MAX_HEX_DIGIT;
 import static software.amazon.event.ruler.Constants.MAX_NUM_DIGIT;
@@ -17,8 +17,8 @@
  *  implementation, the number of digits in the top and bottom of the range is the same.
  */
 public final class Range extends Patterns {
-    private static final byte[] NEGATIVE_HALF_TRILLION_BYTES = doubleToComparableBytes(-ComparableNumber.HALF_TRILLION);
-    private static final byte[] POSITIVE_HALF_TRILLION_BYTES = doubleToComparableBytes(ComparableNumber.HALF_TRILLION);
+    private static final byte[] MIN_RANGE_BYTES = doubleToComparableBytes(-Double.MAX_VALUE);
+    private static final byte[] MAX_RANGE_BYTES = doubleToComparableBytes(Double.MAX_VALUE);
     private static final int HEX_DIGIT_A_DECIMAL_VALUE = 10;
     /**
      * Bottom and top of the range. openBottom true means we're looking for > bottom, false means >=
@@ -51,22 +51,22 @@ private Range(Range range) {
 
     public static Range lessThan(final String val) {
         byte[] byteVal = stringToComparableBytes(val);
-        return between(NEGATIVE_HALF_TRILLION_BYTES, false, byteVal, true);
+        return between(MIN_RANGE_BYTES, false, byteVal, true);
     }
 
     public static Range lessThanOrEqualTo(final String val) {
         byte[] byteVal = stringToComparableBytes(val);
-        return between(NEGATIVE_HALF_TRILLION_BYTES, false, byteVal, false);
+        return between(MIN_RANGE_BYTES, false, byteVal, false);
     }
 
     public static Range greaterThan(final String val) {
         byte[] byteVal = stringToComparableBytes(val);
-        return between(byteVal, true, POSITIVE_HALF_TRILLION_BYTES, false);
+        return between(byteVal, true, MAX_RANGE_BYTES, false);
     }
 
     public static Range greaterThanOrEqualTo(final String val) {
         byte[] byteVal = stringToComparableBytes(val);
-        return between(byteVal, false, POSITIVE_HALF_TRILLION_BYTES, false);
+        return between(byteVal, false, MAX_RANGE_BYTES, false);
     }
 
     public static Range between(final String bottom, final boolean openBottom, final String top, final boolean openTop) {
@@ -128,7 +128,7 @@ public byte minDigit() {
     static byte[] digitSequence(byte first, byte last, boolean includeFirst, boolean includeLast, boolean isCIDR) {
         return isCIDR ?
                 digitSequence(first, last, includeFirst, includeLast, HEX_DIGITS, Range::getHexByteIndex) :
-                digitSequence(first, last, includeFirst, includeLast, BASE64_DIGITS, Range::getNumByteIndex);
+                digitSequence(first, last, includeFirst, includeLast, BASE128_DIGITS, Integer::new);
     }
 
     private static byte[] digitSequence(byte first, byte last, boolean includeFirst, boolean includeLast,
@@ -157,28 +157,6 @@ private static byte[] digitSequence(byte first, byte last, boolean includeFirst,
         return bytes;
     }
 
-    // quickly find the index of chars within Constants.BASE64_DIGITS
-    private static int getNumByteIndex(byte value) {
-        if(value == '+') {
-            return 0;
-        }
-        if (value == '/') {
-            return 1;
-        }
-        // ['0'-'9'] maps to [2 - 11] indexes
-        if (value >= '0' && value <= '9') {
-            return value - '0' + 2;
-        }
-
-        // ['A'-'Z'] maps to [12-37] indexes
-        if (value >= 'A' && value <= 'Z') {
-            return (value - 'A') + 12;
-        }
-
-        // ['a'-'z'] maps to [38-64] indexes
-        return (value - 'a') + 38;
-    }
-
     // quickly find the index of chars within Constants.HEX_DIGITS
     private static int getHexByteIndex(byte value) {
         // ['0'-'9'] maps to [0-9] indexes
@@ -226,8 +204,15 @@ public int hashCode() {
     }
 
     public String toString() {
-        return (new String(bottom, StandardCharsets.UTF_8)) + '/' + (new String(top, StandardCharsets.UTF_8))
-                       + ':' + openBottom + '/' + openTop + ':' + isCIDR +" (" + super.toString() + ")";
+        if(isCIDR) {
+            return (new String(bottom, StandardCharsets.UTF_8)) + '/' + (new String(top, StandardCharsets.UTF_8))
+                    + ':' + openBottom + '/' + openTop + ':' + isCIDR + " (" + super.toString() + ")";
+        } else {
+            return "" +
+                    ComparableNumber.toIntVals(new String(bottom, StandardCharsets.UTF_8)) + '/' +
+                    ComparableNumber.toIntVals(new String(top, StandardCharsets.UTF_8))
+                    + ':' + openBottom + '/' + openTop + ':' + isCIDR + " (" + super.toString() + ")";
+        }
     }
 
     private static byte[] doubleToComparableBytes(double d) {

diff --git a/src/main/software/amazon/event/ruler/ValuePatterns.java b/src/main/software/amazon/event/ruler/ValuePatterns.java
@@ -45,6 +45,10 @@ public int hashCode() {
 
     @Override
     public String toString() {
-        return "VP:" + pattern + " (" + super.toString() + ")";
+        if(type() == MatchType.NUMERIC_EQ) {
+            return "VP:" + ComparableNumber.toIntVals(pattern) + " (" + super.toString() + ")";
+        } else {
+            return "VP:" + pattern + " (" + super.toString() + ")";
+        }
     }
 }